TEST OF SIGNIFICANCE FOR HIGH-DIMENSIONAL LONGITUDINAL DATA

Ethan X Fang; Yang Ning; Runze Li

doi:10.1214/19-aos1900

. Author manuscript; available in PMC: 2021 Oct 1.

Published in final edited form as: Ann Stat. 2020 Sep 19;48(5):2622–2645. doi: 10.1214/19-aos1900

TEST OF SIGNIFICANCE FOR HIGH-DIMENSIONAL LONGITUDINAL DATA

Ethan X Fang ^*, Yang Ning ^†, Runze Li ^‡

PMCID: PMC8277154 NIHMSID: NIHMS1614211 PMID: 34267407

Abstract

This paper concerns statistical inference for longitudinal data with ultrahigh dimensional covariates. We first study the problem of constructing confidence intervals and hypothesis tests for a low dimensional parameter of interest. The major challenge is how to construct a powerful test statistic in the presence of high-dimensional nuisance parameters and sophisticated within-subject correlation of longitudinal data. To deal with the challenge, we propose a new quadratic decorrelated inference function approach, which simultaneously removes the impact of nuisance parameters and incorporates the correlation to enhance the efficiency of the estimation procedure. When the parameter of interest is of fixed dimension, we prove that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal Wald test statistic. We further extend this result and establish the limiting distribution of the estimator under the setting with the dimension of the parameter of interest growing with the sample size at a polynomial rate. Finally, we study how to control the false discovery rate (FDR) when a vector of high-dimensional regression parameters is of interest. We prove that applying the Storey (2002)’s procedure to the proposed test statistics for each regression parameter controls FDR asymptotically in longitudinal data. We conduct simulation studies to assess the finite sample performance of the proposed procedures. Our simulation results imply that the newly proposed procedure can control both Type I error for testing a low dimensional parameter of interest and the FDR in the multiple testing problem. We also apply the proposed procedure to a real data example.

Keywords: False discovery rate, generalized estimating equation, quadratic inference function

1. Introduction.

Longitudinal data are ubiquitous in many scientific studies in biology, social science, economy, and medicine. The major challenge in traditional longitudinal data analysis is how to construct more accurate estimates for regression coefficients by incorporating the within-subject correlation. Liang and Zeger (1986) proposed generalized estimating equation (GEE) method to improve efficiency by using working correlation structure. Qu et al. (2000) proposed quadratic inference function (QIF) approach to further improve the GEE method. Theoretical results for GEE and QIF have been well established by these authors for longitudinal data with fixed dimensional covariates.

In many scientific studies such as genomic studies and neuroscience researches, the dimension of covariates d can far exceed the sample size n. Due to space limitation, we present two concrete motivating examples in the supplementary material, where d is comparable to n and d is much larger than n. Motivated by these applications, it is of great interest to develop statistical inference procedures for longitudinal data with ultra-high dimensional covariates. Variable selection and model selection for longitudinal data have been studied by Wang and Qu (2009); Xue et al. (2010) and Ma et al. (2013) under the finite dimensional setting. Wang et al. (2012) proposed penalized GEE methods under the setting of d = O(n). However, theories developed in the aforementioned works are not applicable for ultrahigh dimensional setting with log d = o(n).

Some statistical inference procedures have been developed for independent and identically distributed (i.i.d.) observations with log d = o(n). van de Geer et al. (2014); Javanmard and Montanari (2013) Zhang and Zhang (2014) developed a debiased estimator for i.i.d. data under linear and generalized linear models, and constructed confidence intervals for low-dimensional parameters. Ning and Liu (2017) proposed a hypothesis testing procedure based on a decorrelated score function method for i.i.d. data, and Fang et al. (2017) further extended the method to the partial-likelihood for survival data. These existing methods and theories are not applicable for longitudinal data under the high-dimensional setting, due to the following two challenges. First, the construction of the optimal QIF (or GEE) depends on the existence of the inverse of the sample covariance matrix of a set of high-dimensional estimating equations (Qu et al., 2000). When the number of features is greater than the sample size, the matrix is not invertible, and therefore the quadratic inference function is not well defined. Second, the existing estimation result (Wang et al., 2012) does not hold under the regime log d = o(n), so that their penalized estimator cannot be used as the initial estimator for asymptotic inference. Due to these difficulties, the existing debiased and decorrelation methods are not applicable to the quadratic inference function for ultra-high dimensional longitudinal data.

In this paper, we propose a new inference procedure for longitudinal data under the regime log d = o(n) by decorrelating the QIF. We first consider how to construct confidence intervals and hypothesis tests for a low-dimensional parameter of interest. Specifically, we start by constructing multiple decorrelated quasi-score functions following the generalized estimating equations (GEE) instead of the likelihood or partial-likelihood function developed in the literature. Each decorrelated quasi-score function aims to capture a particular correlation pattern of the repeated measurements, specified by a basis of correlation matrices. Unlike Wang et al. (2012) who estimated the nuisance parameters by penalized generalized estimating equations with unstructured correlation matrix, we estimate the nuisance parameter under the working independence assumption. This is crucial to guarantee the fast rate of convergence of a preliminary estimator under the regime log d = o(n). Then, we propose to optimally combine the multiple decorrelated quasi-score functions to improve the efficiency of the inference procedures using the generalized method of moment. The resulting loss function is a quadratic form of the decorrelated quasi-score functions and therefore we call it quadratic decorrelated inference function (QDIF). Since the dimensionality of the estimating equations is reduced by using the decorrelated quasi-score functions, its sample covariance matrix is invertible with high probability. Thus, the proposed QDIF is always well defined, whereas the QIF may not exist in high dimensions. In theory, the asymptotic properties of the estimator corresponding to QDIF are studied in the following two regimes. First, when the parameter of interest is of fixed dimension, we show that the proposed estimator is asymptotically normal and attains the semiparametric information bound, based on which we can construct an optimal Wald test statistic. Second, when the dimension of the parameter of interest grows with the sample size at a polynomial rate, we give the characterization of the limiting distribution of the proposed estimator and the associated test statistic.

To further broaden the applicability of the proposed method, we study the following multiple testing problem:

H_{0 j} : β_{j}^{*} = 0 versus H_{1 j} : β_{j}^{*} \neq 0,

for j = 1,…, d, where β* = (β₁, ⋯ , β_d)^T is regression coefficient vector. The null hypothesis H_0j is rejected if our test statistic for $β_{j}^{*}$ is greater than a cutoff. To guarantee most of the rejected null hypothesis being real discoveries, we aim to control the false discovery rate (FDR) within a given significance level by choosing a suitable cutoff for our test statistics. Due to the correlation among repeated measurements, the test statistics for different null hypothesis H_0j become correlated, which makes the FDR control challenging. While the Benjamini-Hochberg method can control FDR if the test statistics are independent or positively dependent (Benjamini and Hochberg, 1995; Benjamini and Yekutieli, 2001), unfortunately, these dependence structures do not hold for our test statistics. To control FDR, we apply the procedure in Storey (2002), which is known to be more powerful than the Benjamini-Hochberg method. Our main result shows that the proposed method can control FDR asymptotically under the dependent test statistics. The intuition is that, by decorrelating the quasi-score function, the correlation among different test statistics becomes weak so that the correlation only contributes to the higher order terms in the FDR approximation, which can be well controlled. The proof of this result relies on a moderate deviation lemma of Liu (2013), who apply the Benjamini-Hochberg procedure to control FDR under Gaussian graphical models. While the FDR control under linear models is recently studied by G’Sell et al. (2016); Barber and Candès (2015), the corresponding sequential procedure and the knockoff method cannot be directly extended to the longitudinal data, due to the dependence structure. To the best of our knowledge, how to control FDR in the analysis of longitudinal data under the generalized linear model remains an open problem. Finally, we note that the proposed method is a general recipe for correlated data which can be easily modified to handle family data and clustered data. To facilitate the presentation, we consider the longitudinal data throughout the paper.

Paper Organization.

The rest of this paper is organized as follows. In Section 2, we propose the QDIF method and the resulting estimator. We further derive the asymptotic distribution of the estimator and construct the test statistic and confidence interval. In Section 3, we consider the FDR control problem. In Section 4, we investigate the empirical performance of the proposed methods using simulation examples and real data example. The proof and technical details are deferred to Appendix. Proofs of technical lemmas are given in the supplementary material of this paper (Fang et al., 2018).

Notation.

We adopt the following notation throughout this paper. For a vector $v = {(v_{1}, \dots, v_{d})}^{T} \in ℝ^{d}$ and 1 ≤ q ≤ ∞, we define $‖ v ‖_{q} = {(\sum_{i = 1}^{d} {| v_{i} |}^{q})}^{1 / q}$ , and let ∥v∥₀ = ∣supp(v)∣, where supp(v) = {j : v_j ≠ 0}, and ∣A∣ is the cardinality of a set A. Denote ∥v∥_∞ = max_1≤i≤d ∣v_i∣ and v^⊗2 = vv^T. For a matrix $M = [M_{j k}] \in ℝ^{d \times d}$ , let ∥M∥_max = max₁≤_j,k≤_d ∣M_jk∣, ∥M∥₁ = ∑_jk ∣M_jk∣ and ∥M∥_∞ = max_j ∑_k ∣M_jk∣. If the matrix M is symmetric, we let λ_min(M) and λ_max(M) be the minimal and maximal eigenvalues of M, respectively. We denote by I_d the d × d identity matrix. For S ⊆ {1, …, d}, let v_S = {v_j : j ∈ S}, and let $\bar{S}$ be the complement of S. The gradient and subgradient of a function f(x) are denoted by ∇f(x) and ∂f(x), respectively. Let ∇_Sf(x) denote the gradient of f(x) with respect to x_S. For two positive sequences a_n and b_n, we write $a_{n} ≍ b_{n}$ if C ≤ a_n/b_n ≤ C′ for some constants C, C′ > 0. Similarly, we use a ≲ b to denote a ≤ Cb for some constant C > 0. For a sequence of random variables X_n, we write $X_{n} ⇝ X$ , for some random variable X, if X_n converges weakly to X, and we write X_n →_p a, for some constant a, if X_n converges in probability to a. Given $a, b \in ℝ$ , let a ∨ b and a ∧ b denote the maximum and minimum of a and b, respectively. For notational simplicity, we use C, C′ to denote generic constants, and their values may change from line to line.

2. Inference in High-Dimensional Longitudinal Data.

Let Y_ij denote the response variable for the j-th observation of the i-th subject, where j = 1, …, m_i and i = 1, …, n. Let $X_{i j} \in ℝ^{d}$ denote the corresponding d-dimensional covariates. Our proposed procedure are still directly applicable for the setting in which m_is are different from subject to subject, but the correlation structure such as the AR and compound symmetry retains the same. We refer to the supplementary material for further discussion. In most applications, m is relatively small comparing with n and d, and we assume throughout the paper that m is fixed.

Denote by $X_{i} = {(X_{i 1}, \dots, X_{i m})}^{T} \in ℝ^{m \times d}$ and $Y_{i} = {(Y_{i 1}, \dots, Y_{i m})}^{T} \in ℝ^{m}$ . We further assume that (X_i, Y_i), i = 1, · · ·, n, are independent, while the within-subject observations are correlated.

2.1. Quadratic inference function in low-dimensional setting.

Under the framework of generalized linear models, we assume that the regression function follows $E (Y_{i j} ∣ X_{i j}) = μ_{i j} (η_{i j}^{*})$ , where μ_ij(·) is a known function and $η_{i j}^{*} = X_{i j}^{T} β^{*}$ with β* being the regression coefficient vector. Liang and Zeger (1986) proposed the GEE method to incorporate the within subject correlation to improve the estimation efficiency of β*. A brief description of this method is given in the Section S.2 of the supplementary material (Fang et al., 2018). The GEE yields consistent estimators for any working correlation structure, while the resulting estimator can be far less efficient when the working correlation structure is misspecified. To overcome this drawback, Qu et al. (2000) proposed an alternative approach called QIF, which avoids direct estimation of the correlation structure, and provides optimal estimator even if the working correlation structure is misspecified. Denote by $η_{i} = {(η_{i 1}, \dots, η_{i m})}^{T} = {(X_{i 1}^{T} β, \dots, X_{i m}^{T} β)}^{T} \in ℝ^{m}$ , $μ_{i} (β) = {μ_{i 1} (η_{i 1}), \dots, μ_{i m} (η_{i m})}^{T} \in ℝ^{m}$ , and V_i = Cov(Y_i∣X_i) is the true covariance matrix of Y_i. We can decompose V_i as $V_{i} = A_{i}^{1 / 2} (β) R A_{i}^{1 / 2} (β)$ . Here, R is the corresponding correlation matrix and A_i(β) is a diagonal matrix in which the (j, j)-th entry is the variance of Y_ij given the covariates and can be written as [A_i(β)]_jj = ϕV_ij(μ_ij), where ϕ is the dispersion parameter and V_ij(·) is a given variance function. We further assume that $V_{i j} (μ_{i j} (η)) = μ_{i j}^{'} (η)$ , which corresponds to the canonical link function under generalized linear models (while we do not impose the distributional assumptions as in GLMs). As seen later, the quasi-score function (2.1) is proportional to the dispersion parameter ϕ, and thus the root of the quasi-score function does not depend on ϕ. For simplicity, we assume ϕ = 1 in the rest of the paper.

In QIF, it is assumed that R⁻¹ can be approximated by the linear space generated by some known basis matrices T₁,…,T_K, i.e., $\sum_{k = 1}^{K} a_{k} T_{k}$ , where a₁,…, a_K are unknown parameters. Given these basis matrices, the quasi-score function of β is defined as

g_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} g_{i} (β) = \frac{1}{n} (\begin{matrix} \sum_{i = 1}^{n} X_{i}^{T} A_{i}^{1 / 2} (β) T_{1} A_{i}^{- 1 / 2} (β) {Y_{i} - μ_{i} (β)} \\ ⋮ \\ \sum_{i = 1}^{n} X_{i}^{T} A_{i}^{1 / 2} (β) T_{K} A_{i}^{- 1 / 2} (β) {Y_{i} - μ_{i} (β)} \end{matrix}) .

(2.1)

The QIF proposed by Qu et al. (2000) is

Q_{n} (β) = g_{n}^{T} (β) C_{n}^{- 1} g_{n} (β), where C_{n} = \frac{1}{n} \sum_{i = 1}^{n} g_{i} (β) g_{i}^{T} (β),

(2.2)

which combines the quasi-score function g_n(β) using the generalized method of moment. Naturally, we estimate β by

\hat{β} = \underset{β}{argmin} Q_{n} (β) .

(2.3)

Qu et al. (2000) showed that $\hat{β}$ is $\sqrt{n}$ -consistent and efficient under the classical fixed dimensional regime. The “large n, diverging d” asymptotics is studied under the generalized additive partial linear models by Wang et al. (2014) when d = o(n^1/5). The variable selection consistency of the penalized QIF estimator is established under the same conditions.

While the estimation and variable selection properties of the penalized GEE and QIF methods have been investigated under the regime d = O(n^α) for some α ≤ 1, how to perform optimal estimation and inference by incorporating the unknown correlation structure remains a challenging problem under the ultra-high dimensional regime, i.e., log d = o(n^α) for some α > 0. In particular, to optimally combine the quasi-score function in QIF, one has to calculate $C_{n}^{- 1}$ in (2.2). However, $C_{n}^{- 1}$ does not exist when d > n. This is the main difficulty for extending the QIF method to high-dimensional data.

2.2. Optimal inference under high-dimensional setting.

In this section, we consider how to make inference on a low-dimensional component of the parameter β in longitudinal data. We focus on the high-dimensional regime, i.e., log d = o(n^α) for some α > 0, which is a more challenging setting in comparison with existing works. For ease of presentation, we partition the vector β as β = (θ^T, γ^T)^T, where θ is a d₀-dimensional parameter of interest with d₀ ≪ n, and γ is a high-dimensional nuisance parameter with dimension d−d₀. Our goal is to construct the confidence region and test the hypotheses H₀ : θ* = 0 versus H₁ : θ* ≠ 0. Similarly, we denote the corresponding partition of X_i by X_i = (Z_i, U_i), where $Z_{i} = {(Z_{i 1}, \dots, Z_{i m})}^{T} \in ℝ^{m \times d_{0}}$ and $U_{i} = {(U_{i 1}, \dots, U_{i m})}^{T} \in ℝ^{m \times (d - d_{0})}$ . In this section, we assume that there exists an initial estimator $\hat{β}$ which converges to the true β* sufficiently fast. Section 2.3 presents a procedure to construct such an initial estimator $\hat{β}$ .

Before we propose the new procedure, we note that inference in high-dimensional problems has been studied under the linear and generalized linear models with independent data (van de Geer et al., 2014; Zhang and Zhang, 2014; Javanmard and Montanari, 2013; Ning and Liu, 2017). Their methods require the existence of a (pseudo)-likelihood function and a penalized estimator such as Lasso. One may attempt to apply their methods to the associated quasi-likelihood of longitudinal data. However, this simple approach is only feasible under the working independence assumption and in general leads to sub-optimal results as the within-subject correlation is ignored (Liang and Zeger, 1986). To increase the efficiency, one may incorporate the within-subject correlation and apply their methods to the quadratic inference function Q_n in (2.2). As explained above, the matrix C_n in (2.2) is not invertible in high dimensions, and the function Q_n is not well-defined. Thus, we cannot directly apply the existing methods for efficient inference in high-dimensional longitudinal data.

To address the challenges, we propose a novel quadratic decorrelated inference function (QDIF) approach. Our proposed method relies on the generalized estimating equations and is distinguished from the methods that directly correct the bias of the Lasso type estimators. Instead, we modify the decorrelation idea in Ning and Liu (2017) to construct estimating equations that are insensitive to the impact of high-dimensional nuisance parameters. As aforementioned, how to design the decorrelation step is challenging in the setting of high-dimensional longitudinal data, as a (pseudo)-likelihood function is not available. Unlike the decorrelated score function constructed from the likelihood in Ning and Liu (2017), we construct a decorrelated quasi-score function directly from the generalized estimating equations in (2.1). Borrowing the idea from the QIF method, we replace the inverse of correlation matrix R⁻¹ in GEE by $\sum_{k = 1}^{K} a_{k} T_{k}$ , for some unknown parameters a₁, …, a_K and some pre-specified basis matrices T₁, …,T_K. For any 1 ≤ i ≤ n and 1 ≤ k ≤ K, we define the decorrelated quasi-score function for subject i with correlation basis T_k as

{\hat{S}}_{i k} (θ) = {(Z_{i} - U_{i} {\hat{W}}_{k})}^{T} A_{i}^{1 / 2} (\hat{β}) T_{k} A_{i}^{- 1 / 2} (\hat{β}) {Y_{i} - μ_{i} (θ, \hat{γ})},

(2.4)

where ${\hat{W}}_{k} \in ℝ^{(d - d_{0}) \times d_{0}}$ , to be defined later, is an estimator of the decorrelation matrix for the k-th basis T_k, and $\hat{β} ≔ {(\hat{θ}, {\hat{γ}}^{T})}^{T}$ is an initial estimator defined in Section 2.3. In comparison with the component of the standard quasi-score function g_n(β) in (2.1), ${\hat{S}}_{i k} (θ)$ decorrelates the score functions for Z_i and U_i by projection via ${\hat{W}}_{k}$ . Denote

{\hat{H}}_{k γ θ} = n^{- 1} \sum_{i = 1}^{n} U_{i}^{T} A_{i}^{1 / 2} (\hat{β}) T_{k} A_{i}^{1 / 2} (\hat{β}) Z_{i}, {\hat{H}}_{k γ γ} = n^{- 1} \sum_{i = 1}^{n} U_{i}^{T} A_{i}^{1 / 2} (\hat{β}) T_{k} A_{i}^{1 / 2} (\hat{β}) U_{i},

H_{k γ θ} = E {U_{i}^{T} A_{i}^{1 / 2} (β^{*}) T_{k} A_{i}^{1 / 2} (β^{*}) Z_{i}}, H_{k γ γ} = E {U_{i}^{T} A_{i}^{1 / 2} (β^{*}) T_{k} A_{i}^{1 / 2} (β^{*}) U_{i}},

and H_kθθ is defined similarly. Then, we define the estimator ${\hat{W}}_{k}$ in (2.4) as

{\hat{W}}_{k} = \underset{W}{argmin} \frac{1}{n} \sum_{i = 1}^{n} tr {{(Z_{i} - U_{i} W)}^{T} A_{i}^{1 / 2} (\hat{β}) T_{k} A_{i}^{1 / 2} (\hat{β}) (Z_{i} - U_{i} W)} + λ^{'} \sum_{k, l} | w_{k l} |

(2.5)

where w_kl is the (k, l)-th element of W, and λ′ is a tuning parameter. This estimator ${\hat{W}}_{k}$ is introduced to estimate the true decorrelation matrix

W_{k}^{*} = H_{k γ γ}^{- 1} H_{k γ θ} .

(2.6)

Then, we define the decorrelated quasi-score function of θ by combining ${\hat{S}}_{i k} (θ)$ ‘s from the different basis matrices,

{\bar{S}}_{n} (θ) = \frac{1}{n} \sum_{i = 1}^{n} {\hat{S}}_{i} (θ) = \frac{1}{n} (\begin{matrix} \sum_{i = 1}^{n} {\hat{S}}_{i 1} (θ) \\ ⋮ \\ \sum_{i = 1}^{n} {\hat{S}}_{i K} (θ) \end{matrix}) .

(2.7)

Note that the decorrelated quasi-score function ${\bar{S}}_{n} (θ)$ is of dimension d₀K instead of dimension dK as g_n(β) in (2.1). As d₀. K ≪ n in our setting, this decorrelated quasi-score function can be used to define the optimal quadratic inference function ${\tilde{Q}}_{n} (θ)$ . In particular, given our initial estimator $\hat{β} = {({\hat{θ}}^{T}, {\hat{γ}}^{T})}^{T}$ , we define our QDIF estimator as

\tilde{θ} = \underset{θ \in Θ_{n}}{argmin} {\tilde{Q}}_{n} (θ), where {\tilde{Q}}_{n} (θ) = n {\bar{S}}_{n} {(θ)}^{T} {\hat{C}}^{- 1} {\bar{S}}_{n} (θ) .

(2.8)

Here $Θ_{n} = {θ \in ℝ^{d_{0}} : ‖ θ - \hat{θ} ‖_{2} \leq c d_{0}^{- 1 / 2}}$ is a neighborhood around the initial estimator $\hat{θ}$ for some small constant c > 0 and

\hat{C} = n^{- 1} \sum_{i = 1}^{n} {\hat{S}}_{i} (\hat{θ}) {\hat{S}}_{i}^{T} (\hat{θ}) \in ℝ^{d_{0} K \times d_{0} K} .

(2.9)

Since ${\tilde{Q}}_{n} (θ)$ is generally a nonconvex function of θ, there may exist multiple local solutions, especially when d₀ is large. To alleviate these issues, we propose the above localized estimator by minimizing ${\tilde{Q}}_{n} (θ)$ in a small neighborhood around the initial estimator $\hat{θ}$ . In the theoretical analysis, we show that ${\tilde{Q}}_{n} (θ)$ is strongly convex for θ ∈ Θ_n with probability tending to one. Thus, any off-the-shelf convex optimization algorithm is applicable to solving the problem (2.8).

2.3. An initial estimator based on working independence structure.

As seen in the previous section, the decorrelated quasi-score function (2.4) requires the knowledge of an initial estimator of β. In this subsection, we shall construct such an initial estimator in the ultra-high dimensional regime, i.e., log d = o(n^α) for some α > 0.

Since the initial estimator is only used to estimate the nuisance parameter in (2.4), we allow the estimator to be less efficient in terms of incorporating the within-subject dependence. The following penalized maximum quasi-loglikelihood estimator under the working independence structure serves this purpose,

\hat{β} = \underset{β}{argmin} L_{n} (β) + P_{λ} (β),

(2.10)

where

L_{n} (β) = - \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{m} \int_{Y_{i j}}^{μ_{i j} (X_{i j}^{T} β)} \frac{Y_{i j} - u}{V_{i j} (u)} d u .

Here $L_{n} (β)$ is known as the negative quasi-loglikelihood under the working independence assumption, and $P_{λ} (\cdot)$ is a penalization term encouraging sparsity of $\hat{β}$ with some tuning parameter λ > 0. The penalization term can be either convex such as Lasso (Tibshirani, 1996) or nonconvex (e.g., SCAD (Fan and Li, 2001)). Before we pursue the statistical properties of $\hat{β}$ further, let us introduce some definitions.

Definition 2.1 (Sub-exponential variable and sub-exponential norm). A random variable X is called sub-exponential if there exists some positive constant K₁ such that $ℙ (| X | > t) \leq exp (1 - t / K_{1})$ for all t ≥ 0. The sub-exponential norm of X is defined as $‖ X ‖_{ψ_{1}} = {sup}_{p \geq 1} p^{- 1} {(E | X |^{p})}^{1 / p}$ .

Furthermore, denoting by s = ∥β*∥₀ the sparsity of β*, we impose the following assumptions.

Assumption 2.1. Assume that the error $ϵ_{i j} = Y_{i j} - μ_{i j} (X_{i j}^{T} β^{*})$ is sub-exponential, i.e., ∥ϵ_ij∥_ψ1 ≤ C for some constant C > 0. The covariates are uniformly bounded, ${max}_{i \in [n]} {‖ X_{i} ‖}_{\infty} = O (1)$ .

Note that the sub-exponential and bounded covariate assumptions are technical assumptions in concentration inequalities and hold in most applications. Similar assumptions are widely used in the literature, e.g., van de Geer et al. (2014) and Ning and Liu (2017), for analyzing high-dimensional generalized linear models.

Assumption 2.2. For any set $S \subset {1, \dots, d}$ where $| S | ≍ s$ and any vector v belonging to the cone $C (ξ, S) = {v \in ℝ^{d} : {‖ v_{\bar{S}} ‖}_{1} \leq ξ {‖ v_{S} ‖}_{1}}$ , it holds that

RE (ξ, S; \nabla^{2} L_{n} (β^{*})) = inf_{0 \neq v \in C (ξ, S)} \frac{v^{T} \nabla^{2} L_{n} (β^{*}) v}{‖ v S ‖_{2}^{2}} \geq λ_{min} > 0.

This assumption is known as the restricted eigenvalue condition (Bühlmann and Van De Geer, 2011), and provides the necessary curvature of the loss function within a cone. Specifically, it bounds the minimal eigenvalue of the Hessian matrix $\nabla^{2} L_{n} (β^{*})$ from below within the cone $C (ξ, S)$ . Under Assumptions 2.1 and 2.2 and the technical conditions in Section 2.4, if $λ ≍ \sqrt{n^{- 1} log d}$ and $P_{λ} (β) = λ ‖ β ‖_{1}$ , a simple modification of Theorem 5.2 in van de Geer and Müller (2012) implies

{‖ \hat{β} - β^{*} ‖}_{q} = O_{ℙ} (s^{1 / q} \sqrt{\frac{log d}{n}}), for q = 1, 2

(2.11)

\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1}^{m} {[X_{i j}^{T} (\hat{β} - β^{*})]}^{2} = O_{ℙ} (\frac{s log d}{n}) .

(2.12)

The rate in (2.11) shows that even if we ignore the correlation structure, the penalized maximum quasi-loglikelihood estimator still converges to β* with the optimal rate in the high-dimensional regime. Assumption 2.2 can be further relaxed by using nonconvex penalties and more tailored statistical optimization algorithms as discussed in Loh and Wainwright (2013); Wang et al. (2013); Fan et al. (2015); Zhao et al. (2018). It can be easily verified that, Assumption 2.2 holds under the conditions in the next subsection and the minimum eigenvalue condition on the population matrix $E (X_{i}^{T} X_{i})$ . For ease of presentation, we assume that Assumptions 2.1 and 2.2 hold throughout our later discussion, and therefore the rate of convergence (2.11) and (2.12) is achieved.

2.4. Theoretical properties.

In this subsection, we establish the asymptotic distribution of $\tilde{θ}$ in (2.8). In the analysis, we assume m, K are fixed, and n, d increase to infinity with log d = o(n^α) for some α > 0. To make the proposed framework more flexible, we allow d₀ to diverge together with n and d. We note that the theoretical results also hold for fixed d₀.

To facilitate our discussion, we impose some technical assumptions. Let

S_{i k}^{*} (θ) = {(Z_{i} - U_{i} W_{k}^{*})}^{T} A_{i}^{1 / 2} (β^{*}) T_{k} A_{i}^{- 1 / 2} (β^{*}) {Y_{i} - μ_{i} (θ, γ^{*})},

(2.13)

S_{i}^{*} (θ) = {(S_{i 1}^{* T} (θ), \dots, S_{i K}^{* T} (θ))}^{T} \in ℝ^{d_{0} K},

(2.14)

be the “ideal” versions of ${\hat{S}}_{i k} (θ)$ and ${\bar{S}}_{n} (θ)$ , respectively. Also, we let

g_{0} (θ^{*}) = E {\nabla S_{i}^{*} (θ^{*})}

(2.15)

denote the population version of the gradient of S*(θ) at θ*, and let

C^{*} = E {S_{i}^{*} (θ^{*}) S_{i}^{* T} (θ^{*})}

(2.16)

denote the population version of the matrix $\hat{C}$ .

Assumption 2.3. The decorrelation matrix $W_{k}^{*}$ is column-wise sparse, i.e., ${max}_{l \in [d_{0}]} (W_{k}^{*}) \cdot l_{0} \leq s^{'}$ , for 1 ≤ k ≤ K. ${max}_{k \in [K]} {max}_{i \in [n]} {‖ U_{i} W_{k}^{*} ‖}_{\infty} = O (1)$ and ${max}_{i \in [n]} {max}_{j \in [m]} | X_{i j}^{T} β^{*} | = O (1)$ .

Assumption 2.4. The mean function μ_ij(·) is third order continuously differentiable and satisfies

max_{i \in [n], j \in [m]} {μ_{i j}^{'} (η_{i j}^{*}), 1 / μ_{i j}^{'} (η_{i j}^{*}), μ_{i j}^{''} (η_{i j}^{*}), μ_{i j}^{'''} (η_{i j}^{*})} = O (1) .

Assumption 2.5. The eigenvalues of T_k and C* are bounded, i.e., C⁻¹ ≤ λ_min(T_k) ≤ λ_max(T_k) ≤ C for any 1 ≤ k ≤ K, C⁻¹ ≤ λ_min(C*) ≤ λ_max(C*) ≤ C. In addition, the following eigenvalue conditions on the design matrix hold, $λ_{min} (E (X_{i}^{T} X_{i})) \geq C^{- 1}$ and $λ_{max} (E (Z_{i}^{T} Z_{i})) \leq C$ for some constant C > 0.

Assumption 2.3 specifies the sparsity of $W_{k}^{*}$ and the bounded covariate effect, which ensure the fast rate of convergence of the estimators ${\hat{W}}_{k}$ . To understand the sparsity assumption on $W_{k}^{*}$ , let us consider d₀ = 1. Denote ${\tilde{Z}}_{i} = A^{1 / 2} Z_{i}$ and ${\tilde{U}}_{i} = A^{1 / 2} U_{i}$ . If there exists $W^{*} \in ℝ^{d - 1}$ such that

{\tilde{Z}}_{i} = {\tilde{U}}_{i} W^{*} + δ, and E (δ ∣ {\tilde{U}}_{i}) = 0,

(2.17)

we can verify that $W_{k}^{*} = W^{*}$ for any 1 ≤ k ≤ K. For instance, if μ_ij(η_ij) is a quadratic function (corresponding to the Gaussian response) and ${(Z_{i j}, U_{i j}^{T})}^{T} ~ N (0, Σ)$ for 1 ≤ j ≤ m follows the Gaussian design, then (2.17) holds and the sparsity assumption on $W_{k}^{*}$ (and W* in (2.17)) is implied by the sparsity of Σ⁻¹, which is a standard condition for high-dimensional inference in the generalized linear model (van de Geer et al., 2014).

Assumption 2.4 provides some local smoothness conditions of μ_i(·)’s around $η_{i}^{*}$ ‘s, and it is easy to verify that this assumption is satisfied for many commonly used regression functions μ_ij(·). In Assumption 2.5, we require the basis matrices T_k to be positive definite. In practice, we usually choose the following matrices as the bases: T₁ an identity matrix I_m, ${\tilde{T}}_{2}$ a matrix of diagonal elements set to be 0’s and off-diagonal elements set to be 1’s, ${\tilde{T}}_{3}$ a matrix with two main diagonals set to be 1’s and 0’s elsewhere, and ${\tilde{T}}_{4}$ with 1’s on the corners (1, 1) and (m, m) and 0 elsewhere. As shown by Qu et al. (2000), the commonly used equal-correlation and AR(1) models can be written as the linear combination of the above four basis matrices. However, the matrices ${\tilde{T}}_{2}$ , ${\tilde{T}}_{3}$ and ${\tilde{T}}_{4}$ are not positive definite. To meet Assumption 2.5, we can add an identity matrix to ${\tilde{T}}_{j}$ and define $T_{j} = {\tilde{T}}_{j} + λ I_{m}$ for some constant λ > 0. The eigenvalue condition on C* in Assumption 2.5, as we shall see later, guarantees the existence of the asymptotic variance, which is even essential in the low-dimensional setting. Finally, the minimum eigenvalue condition on $E (X_{i}^{T} X_{i})$ is used to verify the restricted eigenvalue condition for ${\hat{W}}_{k}$ and the maximum eigenvalue condition on $E (Z_{i}^{T} Z_{i})$ is used to control ∥H_kθθ∥₂, especially when d₀ diverges.

Denote ${‖ A ‖}_{L_{1}} ≔ {max}_{j} \sum_{i} | A_{i j} |$ to be the maximum absolute column sum of the matrix A. The following lemma shows the rate of convergence of the estimation and prediction errors of ${\hat{W}}_{k}$ .

Lemma 2.6. Under Assumptions 2.1, 2.2, 2.3, 2.4 and 2.5, if $λ ≍ λ^{'} ≍ \sqrt{n^{- 1} log d}$ and $s \sqrt{\frac{log d}{n}} = o (1)$ , we have

sup_{1 \leq k \leq K} {‖ {\hat{W}}_{k} - W_{k}^{*} ‖}_{L_{1}} = O_{ℙ} (max (s, s^{'}) \sqrt{\frac{log d}{n}}),

sup_{1 \leq k \leq K} tr (\frac{1}{n} \sum_{i = 1}^{n} {[U_{i} ({\hat{W}}_{k} - W_{k}^{*})]}^{T} Ψ_{i} [U_{i} ({\hat{W}}_{k} - W_{k}^{*})]) = O_{ℙ} (\frac{max (s, s^{'}) d_{0} log d}{n}),

where $Ψ_{i} = A_{i}^{1 / 2} (β^{*}) T_{k} A_{i}^{1 / 2} (β^{*})$ .

Based on Lemma 2.6, we first establish the rate of convergence of the decorrelated estimator $\tilde{θ}$ in (2.8).

Theorem 2.7. Under Assumptions 2.1, 2.2, 2.3, 2.4 and 2.5, if $λ ≍ λ^{'} ≍ \sqrt{n^{- 1} log d}$ , and as n, d → ∞,

max {\frac{d_{0}}{n^{1 / 2}}, \frac{{d_{0} (s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n^{1 / 2}}, \frac{(s \lor s^{'}) log d log n}{n^{1 / 2}}} = o (1),

(2.18)

then the rate of convergence of the estimator $\tilde{θ}$ is

{‖ \tilde{θ} - θ^{*} ‖}_{2} = O_{ℙ} (\sqrt{d_{0} / n}) .

(2.19)

When the dimension of θ* diverges, the convergence rate (2.19) is comparable to Theorem 3.6 of Wang (2011), in which the author showed that the convergence rate of the GEE estimator is $O_{ℙ} (\sqrt{d / n})$ with diverging number of covariates d = o(n^1/2). This also agrees with our condition d₀ = o(n^1/2) in (2.18). When d₀ is fixed, (2.19) implies that the estimator has the standard root-n rate under the sparsity condition (s ∨ s′) log d log n = o(n^1/2), which agrees with the weakest possible assumption in the literature for generalized linear models up to logarithmic factors in d and n (van de Geer et al., 2014).

In order to conduct valid inference, we need to understand the asymptotic distribution of the estimator. The following main theorem of this section establishes the asymptotic normality of our estimator $\tilde{θ}$ .

Theorem 2.8. Under Assumptions 2.1, 2.2, 2.3, 2.4 and 2.5, if $λ ≍ λ^{'} ≍ \sqrt{n^{- 1} log d}$ , and as n, d → ∞,

max {\frac{d_{0}^{3 / 2}}{n^{1 / 2}}, \frac{d_{0} {(s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n^{1 / 2}}, \frac{d_{0}^{1 / 2} (s \lor s^{'}) log d log n}{n^{1 / 2}}} = o (1),

(2.20)

then the estimator $\tilde{θ}$ satisfies, as d₀ → ∞,

\frac{n {(θ^{*} - \tilde{θ})}^{T} Σ_{θ}^{- 1} (θ^{*} - \tilde{θ}) - d_{0}}{\sqrt{2 d_{0}}} ⇝ N (0, 1),

(2.21)

where $Σ_{θ} = {g_{0} {(θ^{*})}^{T} C^{* - 1} g_{0} (θ^{*})}^{- 1}$ and g₀(θ) is defined in (2.15). If d₀ is fixed, we have

\sqrt{n} (\tilde{θ} - θ^{*}) ⇝ N (0, Σ_{θ}) .

(2.22)

Theorem 2.8 characterizes the asymptotic distribution of the decorrelated estimator. In particular, we note that Theorem 2.8 holds whether or not the inverse of the within-subject correlation matrix R⁻¹ is correctly specified via the basis matrices {T_k}. Thus, similar to the classical QIF and GEE methods, our estimator is robust to the specification of the within-subject dependence structure.

When d₀ diverges, (2.21) can be interpreted as $n {(θ^{*} - \tilde{θ})}^{T} Σ_{θ}^{- 1} (θ^{*} - \tilde{θ}) ⇝ χ_{d_{0}}^{2}$ , and one can further approximate $χ_{d_{0}}^{2}$ by N(d₀, 2d₀). To justify the normal approximation of the decorrelated estimator, the required condition d₀ = o(n^1/3) in (2.20) is stronger than that in Theorem 2.7, and again it is comparable to the condition in Theorem 3.8 of Wang (2011). When d₀ is fixed, (2.22) implies that the estimator is asymptotically normal under the same sparsity condition (s ∨ s′) log d log n = o(n^1/2) as in Theorem 2.7. Moreover, if $R^{- 1} = \sum_{k = 1}^{K} a_{k} T_{k}$ is correctly specified, as shown in Corollary 2.10, our estimator $\tilde{θ}$ is semiparametrically efficient.

In order to use the above result to construct confidence regions and statistical tests, we need to estimate the asymptotic variance Σ_θ in (2.21). This can be accomplished by using the plug-in estimator

{\hat{Σ}}_{θ} = {\hat{g} {(\hat{θ})}^{T} {\hat{C}}^{- 1} \hat{g} (\hat{θ})}^{- 1} .

(2.23)

Lemma 2.9. Under the same assumptions as in Theorem 2.8, we have as d₀ → ∞,

\frac{n {(θ^{*} - \tilde{θ})}^{T} {\hat{Σ}}_{θ}^{- 1} (θ^{*} - \tilde{θ}) - d_{0}}{\sqrt{2 d_{0}}} ⇝ N (0, 1),

where Σ_θ and ${\hat{Σ}}_{θ}$ are defined in (2.21) and (2.23), respectively. If d₀ is fixed,

n {(θ^{*} - \tilde{θ})}^{T} {\hat{Σ}}_{θ}^{- 1} (θ^{*} - \tilde{θ}) ⇝ χ_{d_{0}}^{2},

where $χ_{d_{0}}^{2}$ denotes the chi-square distribution with d₀ degrees of freedom.

Consider the following hypothesis testing problem

H_{0} : θ^{*} = 0 versus H_{1} : θ^{*} \neq 0.

Based on the above result, we define the Wald-type test statistic as follows,

{\hat{T}}_{n} = n {\tilde{θ}}^{T} {\hat{Σ}}_{θ}^{- 1} \tilde{θ} .

(2.24)

Lemma 2.9 implies that the distribution of the test statistic ${\hat{T}}_{n}$ can be approximated by a chi-squared distribution with d₀ degrees of freedom under H₀. In addition, we can obtain an asymptotic (1 − α) × 100% confidence region of θ*:

{θ : n {(θ - \tilde{θ})}^{T} {\hat{Σ}}_{θ}^{- 1} (θ - \tilde{θ}) \leq χ_{d_{0}, 1 - α}^{2}},

where $χ_{d_{0}, 1 - α}^{2}$ denotes the (1 − α) × 100%-th percentile of a chi-square distribution with d₀ degrees of freedom.

To conclude this section, we compare the efficiency of the proposed estimator with the decorrelated estimator based on the quasi-likelihood. The latter corresponds to the special case of the estimator (2.8) with K = 1 and T₁ = I. Consider the case d₀ = 1. We denote the estimator by ${\tilde{θ}}_{I}$ . The following corollary shows that our estimator is more efficient than ${\tilde{θ}}_{I}$ and attains the semiparametric information bound. Thus, the proposed QDIF method provides the optimal inference for high-dimensional longitudinal data.

Corollary 2.10. Assume that the assumptions in Theorem 2.8 hold. By taking T₁ = I, we obtain $Avar (\tilde{θ}) \leq Avar ({\tilde{θ}}_{I})$ , where Avar denotes the asymptotic variance of the estimator. Moreover, if the true correlation matrix R satisfies $R^{- 1} = \sum_{k = 1}^{K} a_{k} T_{k}$ for some constants a₁, …, a_K and (2.17) holds, then the proposed estimator $\tilde{θ}$ is semiparametrically efficient.

3. False Discovery Rate Control.

In the previous section, we develop valid inferential methods to test a low-dimensional parameter of interest in high-dimensional longitudinal data. However, in many practical applications, the parameter of interest may not be pre-specified. Instead, we are interested in simultaneously testing all d hypotheses with θ = β_j, i.e., $H_{0 j} : β_{j}^{*} = 0$ versus $H_{1 j} : β_{j}^{*} \neq 0$ for all j = 1, …, d. The knowledge of which null hypotheses are rejected can help us identify important features in the longitudinal data. When conducting multiple hypothesis tests, it is a common practice to control the false discovery rate (FDR) to avoid spurious discoveries. Under the high-dimensional setting, due to the potential dependence among test statistics, how to control FDR is a challenging problem. In this direction, Liu (2013) and Barber and Candès (2015) applied the Benjamini-Hochberg procedure and the knockoff procedure to control the FDR under the Gaussian graphical model and linear model, respectively. Both of their methods crucially depend on the linearity structure and are not directly applicable to the generalized linear model, let alone generalized estimating equations for longitudinal data. Thus, the FDR control for high-dimensional longitudinal data is still largely unexplored, while is of substantial practical importance.

In this section, we extend the procedure discussed in the previous section to control the FDR in multiple hypothesis testing for $H_{0 j} : β_{j}^{*} = 0$ versus $H_{1 j} : β_{j}^{*} \neq 0$ where j = 1, …, d. We first construct the test statistic $Λ_{n j} = n {\hat{σ}}_{j}^{- 2} {\tilde{β}}_{j}^{2}$ for hypothesis H_0j, where ${\hat{σ}}_{j}^{2}$ is defined as in (2.23). Then we obtain the asymptotic p-value $P_{j} = 1 - χ_{1}^{2} (Λ_{n j})$ , where $χ_{1}^{2} (\cdot)$ is the cumulative distribution function of a chi-squared random variable with degree of freedom 1. Given a decision rule that rejects H_0j if and only if P_j ≤ u for some cutoff u, we define the false discovery proportion (FDP) and false discovery rate (FDR) as

FDP (u) = \frac{\sum_{j \in S_{0}} 1 (P_{j} \leq u)}{max {\sum_{j \in [d]} 1 (P_{j} \leq u), 1}}, and         FDR (u) = E [FDP (u)],

where $S_{0} = {j : β_{j}^{*} = 0}$ denotes the set of true null hypotheses. Given the desired FDR level α, we aim to find the cutoff ${\hat{u}}_{α}$ such that $FDR ({\hat{u}}_{α}) \leq α$ . However, in practice we cannot directly compute FDP(u) and FDR(u) as the set $S_{0}$ is unknown. To approximate FDP(u), we utilize the following procedure proposed by Storey (2002), which is known to be more powerful than the Benjamini-Hochberg procedure. Let t ∈ (0, 1) be a tuning parameter. For any u ∈ (0, 1), we define

{\hat{FDP}}_{t} (u) ≔ \frac{π (t) \cdot u \cdot d}{max {\sum_{j = 1}^{d} 1 {P_{j} \leq u}, 1}},

(3.1)

as an approximation of FDP(u), where

π (t) ≔ min {(1 / d) \sum_{j = 1}^{d} 1 {P_{j} \geq t} / (1 - t), 1}

Comparing (3.1) with FDP(u), the denominators are identical. For the numerator, by taking expectation, we have that $E (\sum_{j \in S_{0}} 1 (P_{j} \leq u)) \approx u \cdot | S_{0} |$ , as P_j ~ Uniform(0, 1) asymptotically for all $j \in S_{0}$ . It turns out the quantity π(t) in (3.1) tends to slightly overestimate $| S_{0} | / d$ , the proportion of null hypotheses among all hypotheses. To see this, we have $E [d \cdot π (t)] \approx | S_{0} | + \sum_{j \notin S_{0}} ℙ (P_{j} \geq t) / (1 - t)$ , where $\sum_{j \notin S_{0}} ℙ (P_{j} \geq t) / (1 - t)$ is usually small as P_j are close to 0 if $j \notin S_{0}$ . This leads to a slightly conservative cutoff. However, we show in the proof that this overestimation is asymptotically negligible, i.e., $| S_{0} | / (π (t) \cdot d) \to 1$ in probability as $| S_{0} | / d \to 1$ in the setting of sparse high-dimensional model. Since $| S_{0} | / d \leq 1$ , we force π(t) ≤ 1 by taking the minimum with 1 in the definition of π(t).

Given ${\hat{FDP}}_{t} (u)$ , we define ${\hat{u}}_{α, t}$ to be the largest value of u such that ${\hat{FDR}}_{t} (u)$ is controlled at level α:

{\hat{u}}_{α, t} ≔ sup {0 \leq u \leq 1 : {\hat{FDP}}_{t} (u) \leq α} .

Then, we reject all the null hypotheses of which the corresponding p-values are below ${\hat{u}}_{α, t}$ .. It is easily seen that the well known Benjamini-Hochberg procedure corresponds to choosing π(t) = 1 in ${\hat{FDP}}_{t} (u)$ . By introducing π(t) ≤ 1, the cutoff of p-values ${\hat{u}}_{α, t}$ is no smaller than that in the Benjamini-Hochberg procedure, and therefore leads to more discoveries. Thus, the method is more powerful than the Benjamini-Hochberg procedure. In the literature, the theoretical justification of Storey (2002)’ procedure requires that the p-values are independent and uniformly distributed under the null hypothesis. To prove the validity of the procedure, the main technical difficulty is that the p-values from the proposed test statistics are not independent and their distribution holds only asymptotically. For 1 ≤ j ≤ d, we define

A_{i j} = {g_{0 j} (β_{j}^{*}) C_{j}^{* - 1} g_{0 j} (β_{j}^{*})}^{- 1} g_{0 j} (β_{j}^{*}) C_{j}^{* - 1} \cdot S_{i j}^{*} (β_{j}^{*}) / σ_{j},

where $g_{0 j}^{*}$ , $C_{j}^{*}$ and $S_{i j}^{*} (β_{j}^{*})$ denote the corresponding g₀, C* and $S_{i}^{*} (β_{j}^{*})$ in the previous section for $β_{j}^{*}$ , and $σ_{j}^{2}$ is defined in (2.21) with d₀ = 1. Denote $A : = {(j, k) : j \neq k, | Ω_{j k} | \geq {(log d)}^{- 5 / 2}}$ , where Ω_jk is the correlation between A_ij and A_ik. For some constant C > 0, let $S_{1} = {j : | β_{j} | \geq C \sqrt{\frac{log d}{n}}}$ denote the set of strong signals. As the main result in this section, the following theorem shows that our procedure controls the FDR and FDP asymptotically.

Theorem 3.1. Assume that Assumptions 2.1, 2.2, 2.3, 2.4 and 2.5 hold for all β_j and $s > s^{'}$ . In addition, assume n^−1/2s(log d)(log n) = o(1/log d), s/d = o(1), $| S_{1} | \to \infty, d = O (n^{C})$ for some constant C > 0, and

\sum_{(j, k) \in A} d^{- 2 {(1 + | Ω_{j k} |)}^{- 1}} = o ({(log d)}^{- 3 / 2}) .

(3.2)

Let c > 0 be a small constant. For any t ∈ [c, 1), we have as n, d → ∞,

FDP ({\hat{u}}_{α, t}) \leq α, and  FDR ({\hat{u}}_{α, t}) \leq α in probability .

In the following, we comment on the conditions in the theorem. First, we require n^−1/2s(log d)(log n) = o(1/log d), which is identical to the condition in Theorem 2.8 for fixed d₀ up to a logarithmic factor of d. This guarantees the validity of the p-values under the null asymptotically. Since we consider the sparse model, most β_j’s are 0 implying s/d = o(1). We also require the number of strong signals tends to infinity, $| S_{1} | \to \infty$ , which is needed to control FDP. We require $d = O (n^{C})$ to apply the moderate deviation lemma of Liu (2013). While we cannot allow d to grow exponentially fast in n as in Theorem 2.8, by choosing C > 1, d can still be much larger than n. Finally, (3.2) imposes conditions on the correlation of the test statistics. Recall that Ω_jk is the correlation between A_ij and A_ik. Denote $A_{j} = {1 \leq k \leq d : | Ω_{j k} | \geq {(log d)}^{- 5 / 2}}$ . If ∣Ω_jk∣ ≤ a for some constant a < 1, then (3.2) holds under the assumption $| A_{j} | = o (d^{\frac{1 - a}{1 + a} - δ})$ for some small δ > 0. In particular, under (2.17) and Assumption 2.3, we can show that $| A_{j} | \leq s$ and therefore (3.2) is true provided d is sufficiently large. Thus, the FDR and FDP control is still feasible under dependent test statistics provided their correlation satisfies (3.2).

4. Numerical Studies.

In this section, we evaluate the numerical performance of our proposed method by Monte Carlo simulation and a real data example. We further provide more simulation results in Supplementary Material Section S.4.

4.1. Simulation studies.

We first assess the empirical performance of the proposed method using simulated data. In all settings, we randomly select s out of the d components to be nonzero. We consider two settings where the nonzero components are all 1’s (in what follows, we refer to this setting as Dirac), or each of the nonzero components are generated from a uniform distribution over [0, 2] independently. We consider linear model where each $Y_{i j} = X_{i j}^{T} β^{*} + ϵ_{i j}$ , and ϵ_ij follows a normal distribution with variance equals 1. Note that the noise ϵ_ij’s are correlated with other ϵ_ij′’s as specified later. The cardinality s of the active set is set as 5, 10 or 20, and we let d = 200, 500, 1000, and n = 50, 100. In each simulation, we generate the covariate $X_{i j} ~ N (0, Σ) \in ℝ^{d}$ for each (i, j), where the $(k, l)$ -th element of Σ equals $ρ^{| k - l |}$ , and ρ = 0.25, 0.4, 0.6 or 0.75. We set m = 3 or 5, and we take the within-subject correlation to be either equal-correlation model or AR(1) model. In our method, we include a broad class of matrices as the basis for the inverse of the correlation matrix. Specifically, we generate data from either the equal-correlation model or AR(1) model, and include the basis matrices discussed in Section 2.4. For ease of presentation, we investigate Type I error, false discovery rate and power of our method.

We first consider the empirical Type I error. Specifically, we apply the proposed method to test the null hypothesis $H_{0} : β_{1}^{*} = 0$ , which we assume to be true in our setting. The tuning parameters λ and λ′ are determined by the five-fold cross validation. The simulation is repeated 1,000 times. We report the empirical Type I error at 5% significance level in Tables 1 and 2. It is clearly seen that the proposed test can control the empirical Type I errors at the desired nominal level. This implies the asymptotic distribution of our test statistic is reasonably accurate in finite sample.

Table 1.

Empirical Type I error rate (%) under equal-correlation with correlation parameter being 0.5. The nominal level is set to be 5%.

		ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75
(n, d)	s	Dirac	U[0, 2]	Dirac	U[0, 2]	Dirac	U[0, 2]	Dirac	U[0, 2]
	m = 3
(50,200)	5	5.9	5.5	5.5	5.3	5.8	4.7	4.4	4.6
	10	5.9	5.9	5.2	4.8	4.8	5.8	5.1	4.9
	20	6.1	5.7	5.9	4.8	6.2	6.0	5.4	5.5
(100,500)	5	5.6	5.4	5.7	5.2	6.0	5.9	4.7	5.1
	10	5.8	5.2	5.5	5.3	5.8	5.3	3.9	5.4
	20	5.7	4.9	5.4	4.9	5.6	4.0	6.2	5.7
(100,1000)	5	5.3	5.7	5.8	5.5	5.8	6.3	5.1	4.5
	10	5.9	5.4	5.3	5.5	5.3	5.1	4.0	4.3
	20	6.1	5.8	5.6	5.7	5.5	4.3	6.8	6.2
	m = 5
(50,200)	5	5.5	5.2	5.3	5.2	5.4	5.1	5.0	4.8
	10	5.3	5.2	5.3	5.1	4.9	5.5	5.3	5.0
	20	5.3	5.4	5.4	5.3	5.7	5.3	5.3	5.6
(100,500)	5	5.3	5.3	5.5	5.3	5.4	5.6	4.9	5.2
	10	5.5	5.4	5.4	5.1	5.6	5.5	4.7	4.8
	20	5.5	5.3	5.3	5.0	4.8	4.4	5.8	5.9
(100,1000)	5	5.2	5.5	5.5	5.4	5.5	5.8	5.3	5.4
	10	5.6	5.7	5.0	5.6	5.2	5.3	4.4	4.5
	20	5.9	5.6	5.3	5.5	5.4	4.8	5.7	6.0

Open in a new tab

Table 2.

Empirical Type I error rate (%) under AR-correlation structure with correlation parameter being 0.6. The nominal level is set to be 5%.

		ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75
(n, d)	s	Dirac	U[0, 2]	Dirac	U[0, 2]	Dirac	U[0, 2]	Dirac	U[0, 2]
	m = 3
(50,200)	5	5.4	5.3	5.4	5.5	5.2	5.0	5.3	5.1
	10	4.6	5.1	5.4	4.8	5.1	5.3	5.1	5.2
	20	5.8	5.7	5.2	5.5	5.6	5.2	4.7	5.4
(100,500)	5	5.4	5.1	5.6	5.3	5.5	5.3	5.2	5.3
	10	5.5	5.6	5.7	5.4	5.8	5.6	5.3	5.5
	20	5.6	4.8	5.8	5.3	5.3	4.5	4.6	4.6
(100,1000)	5	5.3	5.4	4.8	4.7	5.3	5.9	5.7	5.6
	10	5.4	5.6	5.3	5.4	5.5	5.3	4.6	4.4
	20	6.1	5.8	5.6	5.7	5.5	5.3	5.9	5.7
	m = 5
(50,200)	5	5.2	5.2	5.3	5.6	5.1	5.2	5.4	5.0
	10	5.1	5.3	5.2	5.1	5.3	5.2	5.2	5.3
	20	5.5	5.6	5.3	5.4	5.5	5.1	4.8	5.3
(100,500)	5	5.3	5.3	5.4	5.5	5.4	5.2	5.1	5.2
	10	5.2	5.7	5.6	5.5	4.8	4.9	5.2	5.6
	20	5.5	5.8	5.5	5.4	5.6	5.2	4.9	4.7
(100,1000)	5	5.5	5.7	5.4	5.6	5.5	5.6	5.4	5.7
	10	5.6	5.7	5.2	5.5	5.6	5.4	5.2	5.3
	20	5.8	6.1	5.8	5.7	6.0	5.9	4.5	4.3

Open in a new tab

Next, we consider the empirical false discovery rate by applying the methods described in Section 3. In particular, we simultaneously test all d hypotheses that $H_{0 j} : β_{j}^{*} = 0$ for all j = 1, …, d. After getting d different p-values, we apply the proposed method under the level α = 0.1 or 0.2. Under the same data generating schemes for investigating empirical Type I error, we repeat the simulation 1,000 times and report the averaged false discovery rate in Tables 3 and 4. We find that the empirical false discovery rates are well controlled under different settings. Furthermore, we plot the empirical false discovery rate against the nominal false discovery rate from 0 to 1 in Figure 1 under several settings. Our approach well controls the false discovery rate for different desired levels. It is worth noting that in the second subfigure, the empirical FDR deviates from the nominal one as the maximum possible false discovery rate is (d − s)/d = 90% when (s, d) = (20, 200).

Table 3.

Empirical false discovery rate (%) at level α = 0.1 and 0.2 under equal-correlation structure with correlation parameter being 0.5.

		ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75
	α	0.1	0.2	0.1	0.2	0.1	0.2	0.1	0.2
(n, d)	s	m = 3
(50,200)	5	9.3	19.1	9.6	19.6	8.9	18.9	10.6	20.8
	10	8.8	19.3	9.2	18.9	9.3	20.7	10.4	21.0
	20	8.7	18.7	8.8	18.8	9.4	19.3	9.4	20.9
(100,500)	5	9.6	19.2	10.3	20.2	11.0	20.9	10.9	21.3
	10	9.7	20.1	9.5	21.1	8.7	20.8	8.8	20.8
	20	9.4	18.9	9.2	18.9	10.5	20.3	11.1	21.3
(100,1000)	5	10.4	20.8	9.5	20.7	9.2	21.3	9.4	20.6
	10	9.5	21.2	9.2	20.6	9.1	20.9	9.8	20.9
	20	9.3	21.8	8.9	21.5	8.7	22.0	12.1	21.4
		m = 5
(50,200)	5	9.5	19.7	9.4	19.5	9.2	19.3	10.0	20.5
	10	9.1	19.5	9.6	19.2	9.1	21.3	10.7	20.4
	20	9.2	19.1	9.0	19.0	9.5	18.9	9.8	18.9
(100,500)	5	9.3	20.5	9.7	21.0	10.6	20.3	10.2	20.7
	10	9.5	19.7	9.6	21.3	9.8	20.4	9.5	21.3
	20	9.7	19.4	9.5	19.2	9.5	19.3	10.9	18.5
(100,1000)	5	9.8	20.4	10.8	20.4	8.9	19.3	9.6	19.5
	10	9.2	20.7	9.3	20.4	9.3	20.5	9.5	20.7
	20	9.1	21.0	10.5	19.2	9.2	21.7	11.3	20.6

Open in a new tab

Table 4.

Empirical false discovery rate (%) at level α = 0.1 and 0.2 under AR-correlation structure with correlation parameter being 0.6

		ρ = 0.25		ρ = 0.4		ρ = 0.6		ρ = 0.75
	α	0.1	0.2	0.1	0.2	0.1	0.2	0.1	0.2
(n, d)	s	m = 3
(50,200)	5	9.6	20.3	9.8	20.4	10.1	20.5	9.7	19.4
	10	9.4	20.3	9.5	20.8	10.4	19.5	9.8	20.9
	20	10.8	19.5	10.6	21.2	10.0	19.3	9.6	18.8
(100,500)	5	10.3	20.6	10.5	20.5	9.5	20.1	10.3	19.8
	10	10.4	19.5	9.6	20.3	9.5	20.8	10.4	21.0
	20	8.9	20.8	9.3	19.2	8.8	21.0	8.7	20.9
(100,1000)	5	10.5	20.6	10.6	20.7	9.3	19.2	9.2	20.3
	10	10.5	20.7	10.8	19.5	10.2	19.4	9.7	20.4
	20	8.7	20.5	11.3	20.7	10.4	20.1	10.9	20.6
		m = 5
(50,200)	5	10.3	20.1	10.0	20.3	9.8	20.5	9.9	19.8
	10	9.7	20.3	9.6	20.8	10.4	19.4	9.4	20.7
	20	9.6	19.5	9.5	19.3	10.5	20.9	10.8	20.8
(100,500)	5	9.5	20.3	9.8	20.6	9.6	19.7	20.4	20.4
	10	9.4	20.8	10.1	20.7	10.9	20.8	9.3	20.8
	20	10.3	20.5	9.2	21.1	10.5	20.9	9.1	19.5
(100,1000)	5	10.7	20.8	10.5	19.5	9.8	20.0	9.6	19.5
	10	10.6	20.5	10.8	20.9	9.2	20.8	8.9	21.1
	20	11.0	21.2	9.3	19.6	9.5	20.3	8.8	18.7

Open in a new tab

Fig 1: — Empirical FDR of the proposed method in AR(1) and equal correlation models, where we take the correlation parameter as 0.75.

Finally, we investigate the empirical power of the proposed test, and compare it with some other high-dimensional inference procedures: the debiased Lasso method (Zhang and Zhang, 2014; van de Geer et al., 2014) and the decorrelation method (Ning and Liu, 2017) by pretending all observations are independent. In particular, we test H₀ : β₁ = 0, under the Dirac setting, where the signal of β₁ gradually increase from 0 to 0.7, and we investigate the empirical rejection rate for different settings. The results are summarized in Figure 2. As expected, our QDIF approach achieves better empirical power, especially when the signal is weak and s is relatively large. This is in line with our theoretical results.

Fig 2: — Empirical power for quadratic decorrelated inference (QDIF), debiased Lasso and decorrelation methods under AR(1) and equal correlation models, where we take the correlation parameter as 0.75.

4.2. BMI dataset.

We further evaluate our method using a BMI genomic dataset from the Framingham Heart Study (FHS). This is a long-term, ongoing cardiovascular study beginning in 1948 under the direction of the National Heart, Lung and Blood institute (NHLBI) on residents of the town of Framingham, Massachusetts. The objective is to identify the important characteristics that contribute to cardiovascular disease. We refer to Jaquish (2007) for more details of the study. Recently, 913,854 SNPs from 24 chromosomes have been genotyped from the Offspring Cohort study. We investigate the issue of obesity as the body mass index (BMI), where BMI = weight(kg)/height(m)². Our dataset contains BMI of 977 samples, where each sample’s BMI value is collected at 26 times. Since there are some missing values presented in the response of different samples, we first adopt n = 234 samples with time points m_i’s ranging from 3 to 7 where their BMI values are recorded. The nonrare SNPs genotypes from the twenty three chromosomes are also recorded. Taking the BMI values as response variables Y, we first screen the features by regressing the BMI values on each of the SNPs, and only keep the SNPs with marginal P-value less than 0.05. This reduces the dimension to d = 4,294. Then, for the jth SNP we treat this covariate as Z in Section 2 and the rest of the covariates as U. We apply the proposed QDIF method to test whether the jth SNP is associated with BMI, where we use the same basis matrices as in the simulations. The obtained p-value is recorded as P_j as in Section 3. We repeat this procedure for all the SNPs, which yields a sequence of p-values P₁, …, P_d. When we select important SNPs based on these p-values, we need to account for the fact that we have been looking at a large number of candidate SNPs (the so called multiple testing effect). Failure to account for the multiple testing effect causes irreproducibility of the results and may yield misleading scientific conclusions. Given the practical importance of this problem, we developed a rigorous result on the FDR control. Applying our result to the data analysis, we find that the 12289th position of the 1st chromosome, 681st, 756th and 19880th SNPs of the 10th chromosome, and 1189th and 12075th SNPs of the 20th chromosome are the significant SNPs under the FDR at 10%. Interestingly, it is known that the 10th and 20th chromosomes are related to the obesity (Dong et al., 2003), which matches our results that the significant SNPs are mostly located at the 10th and 20th chromosomes.

5. Technical Lemmas and Proofs.

5.1. Technical lemmas.

In this subsection, we provide some technical lemmas used in our proofs in Sections 2 and 3. The proofs of these technical lemmas are given in the supplementary material Fang et al. (2018).

The first lemma on the rate of convergence of random matrices in the spectral norm is derived from the matrix Bernsteins inequality and is fundamental for the rest of the proof.

Lemma 5.1. Suppose that Assumptions 2.1 – 2.5 and $\frac{d_{0} {(log n)}^{2} log d_{0}}{n} = o (1)$ hold. Then

max_{1 \leq k \leq K} {‖ \nabla {\bar{S}}_{k}^{*} (θ^{*}) - g_{0 k} (θ^{*}) ‖}_{2} = O_{ℙ} (\sqrt{d_{0} log d_{0} / n}), max_{1 \leq k \leq K} {‖ \frac{1}{n} \sum_{i = 1}^{n} {(Z_{i} - U_{i} W_{k}^{*})}^{T} Ψ_{i} (Z_{i} - U_{i} W_{k}^{*}) - g_{0 k} (θ^{*}) ‖}_{2} = O_{ℙ} (\sqrt{d_{0} log d_{0} / n}), max_{1 \leq k \leq K} {‖ \frac{1}{n} \sum_{i = 1}^{n} Z_{i} Ψ_{i} Z_{i} - E (Z_{i} Ψ_{i} Z_{i}) ‖}_{2} = O_{ℙ} (\sqrt{d_{0} log d_{0} / n}),

(5.1)

and

{‖ \frac{1}{n} \sum_{i = 1}^{n} S_{i}^{*} (θ^{*}) S_{i}^{* T} (θ^{k}) - C^{*} ‖}_{2} = O_{ℙ} (\sqrt{d_{0} log d_{0} {(log n)}^{2} / n}) .

(5.2)

Lemma 5.2. Recall that $\hat{g} (θ) = \nabla {\bar{S}}_{n} (θ)$ and $g_{0} (θ) = E {\nabla S_{i}^{*} (θ)}$ . Under the conditions in Theorem 2.7, we have

{‖ \hat{g} (\hat{θ}) - g_{0} (θ^{*}) ‖}_{2} = O_{ℙ} (\sqrt{\frac{(s \lor s^{'}) d_{0} log d}{n}}) .

{‖ \hat{g} (θ^{*}) - g_{0} (θ^{*}) ‖}_{2} = O_{ℙ} (\sqrt{\frac{(s \lor s^{'}) d_{0} log d}{n}}) .

Lemma 5.3. Recall that $\hat{C} = n^{- 1} \sum_{i = 1}^{n} {\hat{S}}_{i} (\hat{θ}) {\hat{S}}_{i}^{T} (\hat{θ})$ in (2.9), where ${\hat{S}}_{i} (θ) = {{\hat{S}}_{i 1} (θ), \dots, {\hat{S}}_{i K} (θ)}^{T}$ , and $C^{*} = E {S_{i}^{*} (θ^{*}) S_{i}^{*} {(θ^{*})}^{T}}$ in (2.16), where $S_{i}^{*} (θ^{*}) = {(S_{i 1}^{*} (θ^{*}), \dots, S_{i K}^{*} (θ^{*}))}^{T}$ . Under the conditions in Theorem 2.7, we have

{‖ \hat{C} - C^{*} ‖}_{2} = O_{ℙ} (\sqrt{\frac{d_{0} (s \lor s^{'}) log d {(log n)}^{2}}{n}}) .

Lemma 5.4. Under the same conditions as in Theorem 2.7, we have,

max_{1 \leq k \leq K} {‖ {\bar{S}}_{n k} (θ^{*}) - \frac{1}{n} \sum_{i = 1}^{n} S_{i k}^{*} (θ^{*}) ‖}_{2} = O_{ℙ} (\frac{d_{0}^{1 / 2} (s \lor s^{'}) log d log n}{n}) .

Lemma 5.5. Recall that $Q_{n}^{*} (θ) = {\bar{S}}_{n}^{*} {(θ)}^{T} C^{* - 1} {\bar{S}}_{n}^{*} (θ)$ , where ${\bar{S}}^{*} (θ) = \frac{1}{n} \sum_{i = 1}^{n} S_{i}^{*} (θ)$ . Under the same conditions as in Theorem 2.7, we have

{‖ \nabla Q_{n}^{*} (θ^{*}) ‖}_{2} = O_{ℙ} (\sqrt{\frac{d_{0}}{n}}),

{‖ \nabla {\tilde{Q}}_{n} (θ^{*}) - \nabla Q_{n}^{*} (θ^{*}) ‖}_{2} = O_{ℙ} (\frac{d_{0} {(s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n} + \frac{d_{0}^{1 / 2} (s \lor s^{'}) log d log n}{n}) .

Lemma 5.6. Let c be a small constant. Under the conditions in Theorem 2.7, uniformly over ${‖ θ - θ^{*} ‖}_{2} \leq c d_{0}^{- 1 / 2}$ it holds that with probability tending to one

{\tilde{Q}}_{n} (θ) - {\tilde{Q}}_{n} (θ^{*}) - \nabla {\tilde{Q}}_{n} (θ^{*}) (θ - θ^{*}) \geq C {‖ θ - θ^{*} ‖}_{2}^{2}

for some positive constant C.

Lemma 5.7. Suppose Assumptions 2.1, 2.2, 2.3, 2.4 and 2.5 hold for all $β_{j}^{*}$ for j ∈ [d], and suppose that $λ ≍ λ^{'} ≍ \sqrt{n^{- 1}} log d$ and n^−1/2s(log d)(log n) = o(1/log d). Let the “ideal” version of ${\tilde{β}}_{j}$ be

\frac{1}{n} \sum_{i = 1}^{n} A_{i j} = \frac{1}{n} \sum_{i = 1}^{n} {g_{0 j} (β_{j}^{*}) C_{j}^{* - 1} g_{0 j} (β_{j}^{*})}^{- 1} g_{0 j} (β_{j}^{*}) C_{j}^{* - 1} \cdot S_{i j}^{*} (β_{j}^{*}),

(5.3)

where $g_{0 j}^{*}$ , $C_{j}^{*}$ and $S_{i j}^{*} (β_{j}^{*})$ denote corresponding g₀, C* and $S_{i}^{*} (β_{j}^{*})$ in the previous section for $β_{j}^{*}$ , and let the ideal version of ${\hat{T}}_{j}$ be

T_{j} = \sqrt{n} σ_{j}^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} A_{i j}),

where σ_j is defined in (2.21). We have that $\sqrt{n} {\tilde{β}}_{j}$ converges to $\frac{1}{\sqrt{n}} \sum_{i = 1}^{n} A_{i j}$ such that as n → ∞,

max_{j \in H_{0}} | \sqrt{n} {\tilde{β}}_{j} - \frac{1}{\sqrt{n}} \sum_{i = 1}^{n} A_{i j} | = O_{ℙ} (n^{- 1 / 2} s (log d) (log n)),

and

max_{j} | {\hat{T}}_{j} - T_{j} | = O_{ℙ} (n^{- 1 / 2} s (log d) (log n)) .

Lemma 5.8. Suppose the conditions in Theorem 3.1 holds. Let r_d be any sequence such that r_d → ∞ as d → ∞, and $r_{d} = o (| S_{1} |)$ . Then we have

sup_{r_{d} / d \leq u \leq 1} | \frac{\sum_{j \in S_{0}} 1 {P_{j} \leq u}}{u \cdot | S_{0} |} - 1 | \to 0,

in probability.

5.2. Proof of main results.

Proof of Theorem 2.7. Since $θ^{*} - {\hat{θ}}_{2} ≲ \sqrt{s log d / n} \leq c d_{0}^{- 1 / 2}$ under condition (2.18), we claim that θ* lies in Θ_n with probability tending to one. By the definition of $\tilde{θ}$ , it holds that ${\tilde{Q}}_{n} (\tilde{θ}) \leq {\tilde{Q}}_{n} (θ^{*})$ , which further implies

{\tilde{Q}}_{n} (\tilde{θ}) - {\tilde{Q}}_{n} (θ^{*}) - \nabla {\tilde{Q}}_{n} (θ^{*}) (\tilde{θ} - θ^{*}) \leq - \nabla {\tilde{Q}}_{n} (θ^{*}) (\tilde{θ} - θ^{*}) .

By Lemma 5.6, the left hand side is lower bounded by $C {‖ \tilde{θ} - θ^{*} ‖}_{2}^{2}$ . Thus, by Cauchy-Schwartz inequality,

C {‖ \tilde{θ} - θ^{*} ‖}_{2}^{2} \leq {‖ \nabla {\tilde{Q}}_{n} (θ^{*}) ‖}_{2} {‖ \tilde{θ} - θ^{*} ‖}_{2} \leq (\nabla {‖ {\tilde{Q}}_{n} (θ^{*}) - \nabla Q_{n}^{*} (θ^{*}) ‖}_{2} + {‖ \nabla Q_{n}^{*} (θ^{*}) ‖}_{2}) {‖ \tilde{θ} - θ^{*} ‖}_{2} .

Together with Lemma 5.5, we have ${‖ \tilde{θ} - θ^{*} ‖}_{2} = O_{ℙ} (\sqrt{d_{0} / n})$ , which completes the proof.

Proof of Theorem 2.8. We first note that

‖ \tilde{θ} - \hat{θ} ‖_{2} \leq {‖ \tilde{θ} - θ^{*} ‖}_{2} + {‖ \hat{θ} - θ^{*} ‖}_{2} ≲ \sqrt{\frac{d_{0}}{n}} + \sqrt{\frac{s log d}{n}} = o_{ℙ} (\frac{1}{\sqrt{d_{0}}}) .

This implies $\tilde{θ}$ belongs to the interior of Θ_n. By the first order optimality condition, $\tilde{θ}$ satisfies

\hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} {\bar{S}}_{n} (\tilde{θ}) = 0, where \hat{g} (θ) = \partial {\bar{S}}_{n} (θ) / \partial θ .

By the mean-value theorem for vector valued functions, for each component of ${\bar{S}}_{n} (\tilde{θ})$ , say ${({\bar{S}}_{n} (\tilde{θ}))}_{j}$ there exists ${\bar{θ}}_{j} = ζ_{j} θ^{*} + (1 - ζ_{j}) \tilde{θ}$ for some ζ_j ∈ [0, 1] such that ${({\bar{S}}_{n} (\tilde{θ}) - {\bar{S}}_{n} (θ^{*}))}_{j} = {[\partial {({\bar{S}}_{n} ({\bar{θ}}_{j}))}_{j} / \partial θ]}^{T} (\tilde{θ} - θ^{*})$ . For notational simplicity, we suppress the subscript j in ${\bar{θ}}_{j}$ , and write it as

\hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} {{\bar{S}}_{n} (θ^{*}) + \hat{g} {(\bar{θ})}^{T} (\tilde{θ} - θ^{*})} = 0.

Thus, we have

{\hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} \hat{g} (\bar{θ})} (\tilde{θ} - θ^{*}) = - \hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} {\bar{S}}_{n} (θ^{*}) .

Define

T_{1} = [{\hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} \hat{g} (\bar{θ})}^{- 1} - {g_{0} {(θ^{*})}^{T} C^{* - 1} g_{0} (θ^{*})}^{- 1}] \cdot \hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} {\bar{S}}_{n} (θ^{*}),

T_{2} = {g_{0} {(θ^{*})}^{T} C^{* - 1} g_{0} (θ^{*})}^{- 1} {\hat{g} {(\tilde{θ})}^{T} {\hat{C}}^{- 1} {\bar{S}}_{n} (θ^{*}) - g_{0} {(θ^{*})}^{T} C^{* - 1} \cdot \frac{1}{n} \sum_{i = 1}^{n} S_{i}^{*} (θ^{*})} .

\bar{ξ} = \frac{1}{n} \sum_{i = 1}^{n} ξ_{i}, where ξ_{i} = - {g_{0} {(θ^{*})}^{T} C^{* - 1} g_{0} (θ^{*})}^{- 1} g_{0} {(θ^{*})}^{T} C^{* - 1} S_{i}^{*} (θ^{*}) .

Then, it holds that $\tilde{θ} - θ^{*} = - T_{1} - T_{2} + \bar{ξ}$ . Putting together Lemmas 5.2, 5.3 and 5.4, we can show that

{‖ T_{1} ‖}_{2} = O_{ℙ} (\frac{d_{0}^{3 / 2}}{n} + \frac{d_{0} {(s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n})

with some tedious algebraic manipulation similar to the proof of Lemma 5.6. In addition, we can show that

{‖ T_{2} ‖}_{2} = O_{ℙ} (\frac{d_{0}^{3 / 2}}{n} + \frac{d_{0} {(s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n} + \frac{d_{0}^{1 / 2} (s \lor s^{'}) log d log n}{n}) .

Combining the above results, we have

{‖ \tilde{θ} - θ^{*} - \bar{ξ} ‖}_{2} = O_{ℙ} (\frac{d_{0}^{3 / 2}}{n} + \frac{d_{0} {(s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n} + \frac{d_{0}^{1 / 2} (s \lor s^{'}) log d log n}{n}) .

To show the limiting distribution of $n {(\tilde{θ} - θ)}^{T} Σ_{θ}^{- 1} (\tilde{θ} - θ)$ , we first note that

d_{0}^{- 1 / 2} n | {(\tilde{θ} - θ^{*})}^{T} Σ_{θ}^{- 1} (\tilde{θ} - θ^{*}) - {\bar{ξ}}^{T} Σ_{θ}^{- 1} \bar{ξ} | \leq d_{0}^{- 1 / 2} n [| {(\tilde{θ} - θ^{*} - \bar{ξ})}^{T} Σ_{θ}^{- 1} (\tilde{θ} - θ^{*}) | + | {\bar{ξ}}^{T} Σ_{θ}^{- 1} (\tilde{θ} - θ^{*} - \bar{ξ}) |] \leq d_{0}^{- 1 / 2} n [{‖ \tilde{θ} - θ^{*} - \bar{ξ} ‖}_{2} {‖ Σ_{θ}^{- 1} ‖}_{2} {‖ \tilde{θ} - θ^{*} ‖}_{2} + ‖ \bar{ξ} ‖_{2} {‖ Σ_{θ}^{- 1} ‖}_{2} {‖ \tilde{θ} - θ^{*} - \bar{ξ} ‖}_{2}] ≲ \frac{n}{d_{0}^{1 / 2}} [\frac{d_{0}^{3 / 2}}{n} + \frac{d_{0} {(s \lor s^{'}) log d {(log n)}^{2}}^{1 / 2}}{n} + \frac{d_{0}^{1 / 2} (s \lor s^{'}) log d log n}{n}] \frac{d_{0}^{1 / 2}}{n^{1 / 2}} = o_{ℙ} (1),

by Theorem 2.7 and the assumed conditions. Thus, it suffices to show the limiting distribution of ${\bar{ξ}}^{T} Σ_{θ}^{- 1} \bar{ξ}$ . Note that $E {‖ Σ_{θ}^{- 1 / 2} ξ_{i} ‖}_{2}^{3} ≲ E {‖ S_{i}^{*} (θ^{*}) ‖}_{2}^{3} ≲ d_{0}^{3 / 2}$ . Theorem 1.1 in Bentkus (2003) implies

sup_{B \in B} | ℙ (\frac{1}{n^{1 / 2}} \sum_{i = 1}^{n} Σ_{θ}^{- 1 / 2} ξ_{i} \in B) - ℙ (N \in B) | \leq \frac{C d_{0}^{3 / 2}}{n^{1 / 2}},

where $B$ is the set of all Euclidean balls in $ℝ^{d_{0}}$ , C is a positive constant and $N = (N_{1}, \dots, N_{d_{0}})$ are d₀ independent N(0, 1). Finally, we obtain for any $t \in ℝ$ ,

ℙ (\frac{n {\bar{ξ}}^{T} Σ_{θ}^{- 1} \bar{ξ} - d_{0}}{\sqrt{2 d_{0}}} \leq t) \leq ℙ (\frac{‖ N ‖_{2}^{2} - d_{0}}{\sqrt{2 d_{0}}} \leq t) + \frac{C d_{0}^{3 / 2}}{n^{1 / 2}} = ℙ (\frac{1}{\sqrt{d_{0}}} \sum_{i = 1}^{d_{0}} \frac{N_{i}^{2} - 1}{\sqrt{2}} \leq t) + \frac{C d_{0}^{3 / 2}}{n^{1 / 2}} \leq Φ (t) + \frac{C^{'}}{d_{0}^{1 / 2}} + \frac{C d_{0}^{3 / 2}}{n^{1 / 2}}

where Φ(·) is the c.d.f. of a standard normal distribution and C′ is a absolute constant from the standard Berry-Esseen bound. The same probability can be lower bounded by using the same argument. Thus, as d₀, n→ ∞, $(n {\bar{ξ}}^{T} Σ_{θ}^{- 1} \bar{ξ} - d_{0}) / \sqrt{2 d_{0}}$ converges weakly to N(0, 1). When d₀ is fixed, the Layponov condition for $\frac{1}{n} \sum_{i = 1}^{n} v^{T} ξ_{i}$ holds for any $v \in ℝ^{d_{0}}$ and thus $\frac{1}{n^{1 / 2}} \sum_{i = 1}^{n} v^{T} ξ_{i} \to N (0, v^{T} Σ_{θ} v)$ . Finally, we obtain (2.22) by applying the Cramer-Wald device.

Proof of Theorem 3.1. For notational simplicity, we suppress the dependence of ${\hat{u}}_{α, t}$ on t. By the definition of FDP(u) and ${\hat{FDP}}_{t} (u)$ , we have

FDP ({\hat{u}}_{α}) = {\hat{FDP}}_{λ} ({\hat{u}}_{α}) \cdot \frac{| S_{0} |}{π (t) d} \cdot \frac{\sum_{j \in S_{0}} 1 {P_{j} \leq {\hat{u}}_{α}}}{{\hat{u}}_{α} \cdot | S_{0} |} .

(5.5)

We first show that ${(u \cdot | S_{0} |)}^{- 1} \sum_{j \in S_{0}} 1 {P_{j} \leq u} \to 1$ in probability. We prove a moderate deviation result for this quantity. By assumption, we have $| S_{1} | \to \infty$ . Let r_d be a sequence such that r_d → ∞ as d → ∞, and $r_{d} = o (| S_{1} |)$ . We first prove that $ℙ ({\hat{u}}_{α} \leq r_{d} / d) \to 0$ as n → ∞. Note that

1 - χ_{1}^{2} (2 log d) = 2 (1 - Φ (\sqrt{2 log d})) \leq \sqrt{2 / π} {(d \sqrt{2 log d})}^{- 1} ≲ r_{d} / d .

Hence, for any $j \in S_{1}$ , we have

ℙ (P_{j} \leq r_{d} / d) = ℙ (Λ_{n j} \geq 1 - χ_{1}^{- 2} (r_{d} / d)) \geq ℙ (Λ_{n j} \geq 2 log d) .

(5.6)

Recall that $V_{n j} : = \sqrt{n} ({\tilde{β}}_{j} - β_{j}^{*}) / \hat{σ}$ . By extending the proof of Theorem 2.8, we get

lim_{n \to \infty} max_{j \in {1, \dots, d}} sup_{x \in ℝ} | ℙ (V_{n j} \leq x) - Φ (x) | = 0.

(5.7)

For any $j \in S_{1}$ , we have

Λ_{n j} = n {\hat{β}}_{j}^{2} / {\hat{σ}}^{2} = {V_{n j} + \sqrt{n} β_{j}^{*} / \hat{σ}}^{2} .

Therefore, we have that

ℙ (Λ_{n j} \leq 2 log d) \leq ℙ (| V_{n j} + \sqrt{n} β_{j}^{*} / \hat{σ} | \leq \sqrt{2 log d}) \leq ℙ (- | V_{n j} | + \sqrt{n} | β_{j}^{*} | / \hat{σ} \leq \sqrt{2 log d}) * * * \leq ℙ (C C^{'} \sqrt{log d} - {(log d)}^{1 / 4} \leq \sqrt{2 log d}) + ℙ (| V_{n j} | \geq {(log d)}^{1 / 4}) .

where in the last inequality we used the condition that $| β_{j}^{*} | \geq C \sqrt{(log d) / n}$ for $j \in S_{1}$ and $1 / \hat{σ} \geq C^{'}$ . If $C C^{'} > \sqrt{3}$ , hence $C C^{'} \sqrt{log d} - {(log d)}^{1 / 4} > \sqrt{2 log d}$ for large enough d. Moreover, by (5.7)

max_{j \in S_{1}} ℙ (| V_{n j} | \geq {(log d)}^{1 / 4}) \leq max_{j \in S_{1}} | ℙ (| V_{n j} | \geq {(log d)}^{1 / 4}) - 2 {1 - Φ ({(log d)}^{1 / 4})} | + 2 {1 - Φ ({(log d)}^{1 / 4})} \to 0.

Hence we get ${max}_{j \in S_{1}} ℙ (Λ_{n j} \leq 2 log d) \to 0$ . Therefore, by (5.6), we conclude that $ℙ (P_{j} \leq r_{d} / d) \to 1$ , uniformly over $j \in S_{1}$ . Therefore, we have

\frac{1}{| S_{1} |} \sum_{j \in S_{1}} ℙ (P_{j} \leq \frac{r_{d}}{d}) \to 1,

which implies that $1 / | S_{1} | \sum_{j \in S_{1}} 1 {P_{j} \leq r_{d} / d} \to 1$ in L₁ and in probability. Hence, we have

{\hat{FDP}}_{t} (r_{d} / d) = \frac{π (t) \cdot r_{d}}{\sum_{j = 1}^{d} 1 {P_{j} \leq r_{d} / d}} \leq π (t) \cdot \frac{| S_{1} |}{\sum_{j \in S_{1}} 1 {P_{j} \leq r_{d} / d}} \cdot \frac{r_{d}}{| S_{1} |} .

As $r_{d} = o (| S_{1} |)$ , we conclude that ${\hat{FDP}}_{t} (r_{d} / d) \to 0 < α$ in probability, and hence by the definition of ${\hat{u}}_{α}$ , we obtain

ℙ ({\hat{u}}_{α} \geq r_{d} / d) \to 1.

(5.8)

Hence, by (5.8) and Lemma 5.8, we conclude that ${({\hat{u}}_{α} \cdot | S_{0} |)}^{- 1} \sum_{j \in S_{0}} 1 {P_{j} \leq {\hat{u}}_{α}} \to 1$ in probability. Finally, by the definition of π(t), we have

\frac{| S_{0} |}{π (t) d} = \frac{| S_{0} |}{min (\sum_{j = 1}^{d} 1 {P_{j} \geq t} / (1 - t), d)} .

If $\sum_{j = 1}^{d} 1 {P_{j} \geq t} / (1 - t) \leq d$ we have

\frac{| S_{0} |}{π (t) d} = \frac{| S_{0} | (1 - t)}{d - \sum_{j = 1}^{d} 1 {P_{j} \leq t}} = \frac{| S_{0} | (1 - t)}{d - (\sum_{j \in S_{0}} + \sum_{j \notin S_{0}}) 1 {P_{j} \leq t}} .

We have $| S_{0} | / d \to 1$ , and $(1 / | S_{0} |) \sum_{j \notin S_{0}} 1 {P_{j} \leq t} \leq (d - | S_{0} |) / | S_{0} | \to 0$ in probability. Moreover, given any t ∈ [c, 1), for d large enough, we have t ≥ r_d/d. Hence by Lemma (5.8), we have $(1 / | S_{0} |) \sum_{j \in S_{0}} 1 {P_{j} \leq t} \to t$ in probability. Hence we conclude that $| S_{0} | / (π (t) \cdot d) \to 1$ in probability. On the other hand, if $\sum_{j = 1}^{d} 1 {P_{j} \geq t} / (1 - t) > d$ , we have $| S_{0} | / (π (t) \cdot d) = | S_{0} | / d \to 1$ .

Putting together the above results and by (5.5), we get $FDP ({\hat{u}}_{α}) \leq α$ and similarly $FDR ({\hat{u}}_{α}) \leq α$ in probability, hence concluding the proof.

Supplementary Material

Suppl

NIHMS1614211-supplement-Suppl.pdf^{(449.3KB, pdf)}

Acknowledgment.

The authors are grateful for the AE and reviewers for their constructive comments, which lead to a significant improvement of the earlier version of this paper. Li is the corresponding author, and he also received partial travel support from NNSFC grants 11690014 and 11690015 during his visit at Chinese Academy of Sciences and Nankai University.

* Supported by NIH grant P50 DA039838 and NSF grant DMS 1820702.

† Supported by NSF grant DMS 1854637.

‡ Supported by NSF grant DMS 1820702, NIH grants P50 DA039838 and T32 LM012415.

References.

Barber RF and Candès EJ (2015). Controlling the false discovery rate via knockoffs. Ann. Statist, 43 2055–2085. [Google Scholar]
Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 289–300. [Google Scholar]
Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist 1165–1188. [Google Scholar]
Bentkus V (2003). On the dependence of the Berry-Esseen bound on dimension. Journal of Statistical Planning and Inference, 113 385–402. [Google Scholar]
Bühlmann P and Van De Geer S (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media. [Google Scholar]
Dong C, Wang S, Li W-D, Li D, Zhao H and Price RA (2003). Interacting genetic loci on chromosomes 20 and 10 influence extreme human obesity. Amer. J. Hum. Genet, 72 115–124. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Asooc, 96 1348–1360. [Google Scholar]
Fan J, Liu H, Sun Q and Zhang T (2015). TAC for sparse learning: Simultaneous control of algorithmic complexity and statistical error. arXiv preprint arXiv:1507.01037 [DOI] [PMC free article] [PubMed] [Google Scholar]
Fang EX, Ning Y and Li R (2018). Supplement to “Test of significance for high-dimensional longitudinal data”. [DOI] [PMC free article] [PubMed]
Fang EX, Ning Y and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. J. R. Stat. Soc. Ser. B, 79 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]
G’Sell MG, Wager S, Chouldechova A and Tibshirani R (2016). Sequential selection procedures and false discovery rate control. J. R. Stat. Soc. Ser. B, 78 423–444. [Google Scholar]
Jaquish CE (2007). The framingham heart study, on its way to becoming the gold standard for cardiovascular genetic epidemiology? BMC Med. Genet, 8 63. [DOI] [PMC free article] [PubMed] [Google Scholar]
Javanmard A and Montanari A (2013). Confidence intervals and hypothesis testing for high-dimensional statistical models. In Advances in Neural Information Processing Systems. 1187–1195. [Google Scholar]
Liang K-Y and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22. [Google Scholar]
Liu W (2013). Gaussian graphical model estimation with false discovery rate control. Ann. Statist, 41 2948–2978. [Google Scholar]
Loh P-L and Wainwright MJ (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems. 476–484. [Google Scholar]
Ma S, Song Q and Wang L (2013). Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data. Bernoulli, 19 252–274. [Google Scholar]
Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist, 45 158–195. [Google Scholar]
Qu A, Lindsay BG and Li B (2000). Improving generalised estimating equations using quadratic inference functions. Biometrika, 87 823–836. [Google Scholar]
Storey JD (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B, 64 479–498. [Google Scholar]
Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 267–288. [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42 1166–1202. [Google Scholar]
van de Geer S and Müller P (2012). Quasi-likelihood and/or robust estimation in high dimensions. Stat. Sci 469–480. [Google Scholar]
Wang L (2011). GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist, 39 389–417. [Google Scholar]
Wang L, Kim Y and Li R (2013). Calibrating non-convex penalized regression in ultra-high dimension. Ann. Statist, 41 2505. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang L and Qu A (2009). Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. J. R. Stat. Soc. Ser. B, 71 177–190. [Google Scholar]
Wang L, Xue L, Qu A and Liang H (2014). Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. Ann. Statist, 42 592–624. [Google Scholar]
Wang L, Zhou J and Qu A (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics, 68 353–360. [DOI] [PubMed] [Google Scholar]
Xue L, Qu A and Zhou J (2010). Consistent model selection for marginal generalized additive model for correlated data. J. Amer. Statist. Assoc, 105 1518–1530. [Google Scholar]
Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B, 76 217–242. [Google Scholar]
Zhao T, Liu H and Zhang T (2018). Pathwise coordinate optimization for sparse learning: Algorithm and theory. Ann. Statist, 46 180–218. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Suppl

NIHMS1614211-supplement-Suppl.pdf^{(449.3KB, pdf)}

[R1] Barber RF and Candès EJ (2015). Controlling the false discovery rate via knockoffs. Ann. Statist, 43 2055–2085. [Google Scholar]

[R2] Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 289–300. [Google Scholar]

[R3] Benjamini Y and Yekutieli D (2001). The control of the false discovery rate in multiple testing under dependency. Ann. Statist 1165–1188. [Google Scholar]

[R4] Bentkus V (2003). On the dependence of the Berry-Esseen bound on dimension. Journal of Statistical Planning and Inference, 113 385–402. [Google Scholar]

[R5] Bühlmann P and Van De Geer S (2011). Statistics for High-Dimensional Data: Methods, Theory and Applications. Springer Science & Business Media. [Google Scholar]

[R6] Dong C, Wang S, Li W-D, Li D, Zhao H and Price RA (2003). Interacting genetic loci on chromosomes 20 and 10 influence extreme human obesity. Amer. J. Hum. Genet, 72 115–124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Fan J and Li R (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Asooc, 96 1348–1360. [Google Scholar]

[R8] Fan J, Liu H, Sun Q and Zhang T (2015). TAC for sparse learning: Simultaneous control of algorithmic complexity and statistical error. arXiv preprint arXiv:1507.01037 [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fang EX, Ning Y and Li R (2018). Supplement to “Test of significance for high-dimensional longitudinal data”. [DOI] [PMC free article] [PubMed]

[R10] Fang EX, Ning Y and Liu H (2017). Testing and confidence intervals for high dimensional proportional hazards models. J. R. Stat. Soc. Ser. B, 79 1415–1437. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] G’Sell MG, Wager S, Chouldechova A and Tibshirani R (2016). Sequential selection procedures and false discovery rate control. J. R. Stat. Soc. Ser. B, 78 423–444. [Google Scholar]

[R12] Jaquish CE (2007). The framingham heart study, on its way to becoming the gold standard for cardiovascular genetic epidemiology? BMC Med. Genet, 8 63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Javanmard A and Montanari A (2013). Confidence intervals and hypothesis testing for high-dimensional statistical models. In Advances in Neural Information Processing Systems. 1187–1195. [Google Scholar]

[R14] Liang K-Y and Zeger SL (1986). Longitudinal data analysis using generalized linear models. Biometrika, 73 13–22. [Google Scholar]

[R15] Liu W (2013). Gaussian graphical model estimation with false discovery rate control. Ann. Statist, 41 2948–2978. [Google Scholar]

[R16] Loh P-L and Wainwright MJ (2013). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems. 476–484. [Google Scholar]

[R17] Ma S, Song Q and Wang L (2013). Simultaneous variable selection and estimation in semiparametric modeling of longitudinal/clustered data. Bernoulli, 19 252–274. [Google Scholar]

[R18] Ning Y and Liu H (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist, 45 158–195. [Google Scholar]

[R19] Qu A, Lindsay BG and Li B (2000). Improving generalised estimating equations using quadratic inference functions. Biometrika, 87 823–836. [Google Scholar]

[R20] Storey JD (2002). A direct approach to false discovery rates. J. R. Stat. Soc. Ser. B, 64 479–498. [Google Scholar]

[R21] Tibshirani R (1996). Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 267–288. [Google Scholar]

[R22] van de Geer S, Bühlmann P, Ritov Y and Dezeure R (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42 1166–1202. [Google Scholar]

[R23] van de Geer S and Müller P (2012). Quasi-likelihood and/or robust estimation in high dimensions. Stat. Sci 469–480. [Google Scholar]

[R24] Wang L (2011). GEE analysis of clustered binary data with diverging number of covariates. Ann. Statist, 39 389–417. [Google Scholar]

[R25] Wang L, Kim Y and Li R (2013). Calibrating non-convex penalized regression in ultra-high dimension. Ann. Statist, 41 2505. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Wang L and Qu A (2009). Consistent model selection and data-driven smooth tests for longitudinal data in the estimating equations approach. J. R. Stat. Soc. Ser. B, 71 177–190. [Google Scholar]

[R27] Wang L, Xue L, Qu A and Liang H (2014). Estimation and model selection in generalized additive partial linear models for correlated data with diverging number of covariates. Ann. Statist, 42 592–624. [Google Scholar]

[R28] Wang L, Zhou J and Qu A (2012). Penalized generalized estimating equations for high-dimensional longitudinal data analysis. Biometrics, 68 353–360. [DOI] [PubMed] [Google Scholar]

[R29] Xue L, Qu A and Zhou J (2010). Consistent model selection for marginal generalized additive model for correlated data. J. Amer. Statist. Assoc, 105 1518–1530. [Google Scholar]

[R30] Zhang C-H and Zhang SS (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B, 76 217–242. [Google Scholar]

[R31] Zhao T, Liu H and Zhang T (2018). Pathwise coordinate optimization for sparse learning: Algorithm and theory. Ann. Statist, 46 180–218. [Google Scholar]

PERMALINK

TEST OF SIGNIFICANCE FOR HIGH-DIMENSIONAL LONGITUDINAL DATA

Ethan X Fang

Yang Ning

Runze Li

Abstract

1. Introduction.

Paper Organization.

Notation.

2. Inference in High-Dimensional Longitudinal Data.

2.1. Quadratic inference function in low-dimensional setting.

2.2. Optimal inference under high-dimensional setting.

2.3. An initial estimator based on working independence structure.

2.4. Theoretical properties.

3. False Discovery Rate Control.

4. Numerical Studies.

4.1. Simulation studies.

Table 1.

Table 2.

Table 3.

Table 4.

Fig 1:

Fig 2:

4.2. BMI dataset.

5. Technical Lemmas and Proofs.

5.1. Technical lemmas.

5.2. Proof of main results.

Supplementary Material

Acknowledgment.

References.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

TEST OF SIGNIFICANCE FOR HIGH-DIMENSIONAL LONGITUDINAL DATA

Ethan X Fang

Yang Ning

Runze Li

Abstract

1. Introduction.

Paper Organization.

Notation.

2. Inference in High-Dimensional Longitudinal Data.

2.1. Quadratic inference function in low-dimensional setting.

2.2. Optimal inference under high-dimensional setting.

2.3. An initial estimator based on working independence structure.

2.4. Theoretical properties.

3. False Discovery Rate Control.

4. Numerical Studies.

4.1. Simulation studies.

Table 1.

Table 2.

Table 3.

Table 4.

Fig 1:

Fig 2:

4.2. BMI dataset.

5. Technical Lemmas and Proofs.

5.1. Technical lemmas.

5.2. Proof of main results.

Supplementary Material

Acknowledgment.

References.

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases