Statistical significance in high-dimensional linear mixed models

Lina Lin; Mathias Drton; Ali Shojaie

doi:10.1145/3412815.3416883

. Author manuscript; available in PMC: 2022 Apr 29.

Published in final edited form as: FODS 20 (2020). 2020 Oct 18;2020:171–181. doi: 10.1145/3412815.3416883

Statistical significance in high-dimensional linear mixed models

Lina Lin ^a, Mathias Drton ^b, Ali Shojaie ^c

PMCID: PMC9053448 NIHMSID: NIHMS1735582 PMID: 35497571

Abstract

This paper concerns the development of an inferential framework for high-dimensional linear mixed effect models. These are suitable models, for instance, when we have n repeated measurements for M subjects. We consider a scenario where the number of fixed effects p is large (and may be larger than M), but the number of random effects q is small. Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators to perform inference for high-dimensional linear models with fixed effects only. In particular, we demonstrate how to correct a ‘naive’ ridge estimator in extension of work by Bühlmann (2013) to build asymptotically valid confidence intervals for mixed effect models. We validate our theoretical results with numerical experiments, in which we show our method outperforms those that fail to account for correlation induced by the random effects. For a practical demonstration we consider a riboflavin production dataset that exhibits group structure, and show that conclusions drawn using our method are consistent with those obtained on a similar dataset without group structure.

1. Introduction

Modern statistical problems are increasingly high-dimensional, with the number of covariates p potentially vastly exceeding the sample size N. This is due in part to technological advances that facilitate data collection. For instance, we are now able to measure the expression of many genes in a given specimen at little cost. However, it often remains expensive to have many replicates/species to experiment on, resulting in N ≪ p.

Fortunately, significant progress has been made in developing rigorous statistical tools for tackling such problems. While earlier work largely targeted point estimation and/or variable selection, recent years have seen a number of proposals on how to also assign uncertainty, statistical significance and confidence in high-dimensional models. This is of great practical importance, particularly when interpretation of parameters and variables is of key priority.

Early attempts are highly varied in their approach. Stability selection was proposed by Meinshausen and Bühlmann (2010) as a generic method for controlling the expected number of false positive selections; with improvements given by Shah and Samworth (2013). Sample splitting, where a first subsample is used to screen, and a second subsample is used to perform inference (Wasserman and Roeder, 2009; Meinshausen et al., 2009) has also been explored. Taking an alternative approach, Lockhart et al. (2014), Tibshirani et al. (2014) and Lee et al. (2016) build a framework for conditional inference for high-dimensional linear models, i.e., conduct inference given some covariates have been selected.

In this paper, we propose an unconditional inferential framework for high-dimensional linear mixed effect models, with the goal of testing null hypotheses of the form

H_{0, G} : β_{j}^{*} = 0 for all j \in G

(1.1)

where $β^{*} \in R^{p}$ is the vector of fixed effect regression coefficients, and G may be any subset of {1,…,p}. Of particular interest is the case G = {j}, i.e., testing if a single fixed effect coefficient $β_{j}^{*}$ is zero. A related goal is to construct confidence intervals for $β_{j}^{*}$ , j = 1,…,p. This problem arises naturally in many settings, as observations are rarely independent. A prime example is the analysis of longitudinal data, which is highly prevalent in clinical studies. In such settings mixed effect models are a natural extension of linear models for modeling data exhibiting group-structured dependence.

Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators as an approach to inference for high-dimensional linear models with fixed effects only. There, the limiting distribution of the modified estimator is tractable and, thus, can be used to construct approximate p-values and confidence intervals. For example, in high-dimensional linear regression, Zhang and Zhang (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014) suggest de-sparsifying the lasso: starting with the biased lasso estimator, the authors ‘invert’ the corresponding Karuhn-Kush-Tucker (KKT) optimality conditions to form an estimator that is approximately unbiased for β* and normally distributed. By construction, the de-biased estimator can then be used to derive confidence intervals and p-values. Ning and Liu (2017) extended this strategy by developing a score test for inference in penalized M-estimators.

Our proposed method bears strongest resemblance to Bühlmann (2013). Developed for high-dimensional linear models, the framework of Bühlmann (2013) is similar to those put forth by Zhang and Zhang (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014), except it uses ridge estimation as a starting point. While the overall framework we consider is similar, there are important differences in the specifics on how to correct — or rather, approximately correct — for the bias in the ridge estimator, and how to compute an approximation of the limiting distribution of the de-biased estimator, to construct p-values and confidence intervals for elements in β*. As will be evident later, these differences are direct results of having to cope with dependencies induced by the random effects in the linear mixed effect model. The naive treatment of ignoring the dependencies, as we demonstrate in numerical examples, leads to poor practical performance (particularly, when inverting estimator to obtain confidence intervals, the confidence intervals have insufficient coverage). We address this issue by introducing a two-stage procedure that yields consistent estimates of the parameters that determine these dependencies. While we describe a ridge-based framework, the methodology could be extended to make use of other high-dimensional estimators as the starting point for constructing a de-biased estimator.

Our decision to use a ridge estimator is based on simulation findings for standard linear models showing that while asymptotically optimal, confidence intervals from ℓ₁-based de-biasing (Zhang and Zhang, 2014; van de Geer et al., 2014; Javanmard and Montanari, 2014) tend to have coverage problems in finite samples. Yu et al. (2018) similarly noticed that confidence intervals based on a de-biased ℓ₁-estimator for high-dimensional Cox model had poorer than theoretical coverage in practice. Although its theoretical justification is similar, the ridge-based method of Bühlmann (2013) yields better finite-sample error control.

Our paper is organized as follows. The remainder of this section provides a brief overview of the subsequent notation. Section 2 makes explicit the form of the high-dimensional linear mixed effect model we are working with. In Section 3, we describe the details of our method: specifically, how it builds upon Bühlmann (2013) to accommodate dependence within groups induced by the random effects. We also present theory, along with the required assumptions, to justify it. Numerical experiments can be found in Section 4, followed by a practical application of the method in Section 5. We conclude with a discussion and elaborate on potential extensions in Section 6. Proofs are collected in the Appendix.

Notation

Matrices are written in upper-case bold-face and their entries in corresponding lower-case. So a_jk is the (j, k)th entry of matrix $A \in R^{n_{1} \times n_{2}}$ . For j ∈ {1,…,n₂} and J ⊆ {1,…,n₂}, a_j and A_J denote the jth column of A and the column-wise concatenation of columns in A indexed by the set J, respectively. The ith row of A is denoted a⁽ⁱ⁾. For r ∈ [1, ∞], the ℓ_r norm of a vector $u \in R^{n}$ is $‖ u ‖_{r} = (\sum_{i = 1}^{p} ∣ u_{i} ∣^{r})^{1 ∕ r}$ , and the induced norm of a matrix $A \in R^{n_{1} \times n_{2}}$ is ${⫴ A ⫴}_{r} = sup {{⫴ A x ⫴}_{r} : x \in R^{n_{2}}, ‖ x ‖_{r} = 1}$ . With this notation, ⫼A⫼₂ is the spectral norm, ⫼A⫼₁ the maximum absolute column sum of the matrix, and ⫼A⫼_∞ the maximum absolute row sum of the matrix. We use ∥A∥_r to denote the ℓ_r norm of the vectorization of A.

The projection of $R^{n_{2}}$ onto the linear space generated by the rows of A is denoted P_A = A(A^TA)⁻A^T, where A⁻ is the Moore-Penrose inverse of A. For square matrices A₁ and A₂ of the same dimensions, A₁ ≤ A₂ indicates that A₂ – A₁ is positive semi-definite.

For real-valued functions g₁(x) and g₂(x) defined on (0, ∞), we write g₁(x) ≲ g₂(x) if there is a constant c ∈ (0, ∞) such that g₁(x) ≤ cg₂(x), and g₁(x) ≳ g₂(x) if instead g₁(x) ≥ cg₂(x). We write g₁(x) ≍ g₂(x) if both g₁(x) ≲ g₂(x) and g₁(x) ≳ g₂(x). Then, g₁(x) = o(g₂(x)) if g₁(x)/g₂(x) → 0 as x → ∞, and g₁(x) = O(g₂(x)) if there is a c ∈ (0, ∞) such that ∣g₁(x)∣ ≤ cg₂(x) for all x large enough. The latter relations also apply when x is a vector, where x → ∞ is interpreted elementwise. Finally, if $X \in R$ is a random variable and $a \in R$ is some constant, we write ∣X – a∣ = o_P(1) if X converges to a in probability, i.e., X →_p a.

2. The linear mixed effect model

Consider M groups of observations of sizes n₁,…,n_M. Let m = 1,…, M be group indices, and let i = 1,…,n_m index the observations within group m. Let N be the total number of observations, so $N = \sum_{m = 1}^{M} n_{m}$ . We may later assume, without loss of generality, that n_m = n for all groups, or that, N = nM. The proposed framework allows for non-uniform group sizes with minor adjustments, so long as the group sizes are of the same order.

For group m ∈ {1,…,M}, we observe the response vector $y_{m} \in R^{n}$ , generated as

y_{m} = X_{m} β^{*} + Z_{m} v_{m} + ϵ_{m}, m = 1, \dots, M

(2.1)

with

$β^{*} \in R^{p}$ , an unknown vector of fixed regression coefficients;
$v_{m} \in R^{q}$ , m = 1,…,M vectors of group-specific random effects, with $v_{m} \underset{i . i . d .}{\sim} N (0, Ψ^{*})$ , Ψ* an unknown q × q positive definite covariance matrix;
errors $ϵ_{m} \underset{i . i . d .}{\sim} N (0, σ^{* 2} I_{n \times n})$ for unknown σ*², which are independent of v₁,…,v_M; and
$X_{m} \in R^{n \times p}$ and $Z_{m} \in R^{k \times q}$ known design matrices.

By construction, β* represents effects shared across groups while v_m, m = 1,…,M, represent group-specific deviations. It will be convenient to write the model more compactly. Define vectors $y = [y_{1}^{T}, \dots, y_{M}^{T}]^{T}$ , $v = [v_{1}^{T}, \dots, v_{M}^{T}]^{T}$ , $ϵ = [ϵ_{1}^{T}, \dots, ϵ_{M}^{T}]^{T}$ , a stacked matrix $X = [X_{1}^{T}, \dots, X_{M}^{T}]^{T}$ , and Z = diag(Z₁,…,Z_M). Then we can write (2.1) as

y = X β^{*} + Z v + ϵ .

(2.2)

Marginalizing out the random effects yields

y \sim N (X β^{*}, V (σ^{* 2}, Ψ^{*})) with V (σ^{* 2}, Ψ^{*})) = σ^{* 2} I_{N \times N} + Z Ψ^{*}) Z^{T},

(2.3)

where Ψ*^(B) = I_M×M ⊗ Ψ*. This implies that V(σ*², Ψ*) is block-diagonal and observations belonging to different groups are independent. Thus, the inclusion of random effects only induces dependencies between observations belonging to the same group. We will be primarily working with the marginal form (2.3) in subsequent sections.

We study the presented model under the following assumptions:

High dimensions: We allow p, the number of fixed regression coefficients, to be possibly much larger than N. On the other hand, q, the number of random effect variables, is assumed to be of constant order, or at least smaller than n.
Sparsity of β*: We assume β* to be sparse in the sense that most of its elements are zero: a more precise specification on the level of sparsity required is detailed in Section 3.2.
Structure of Ψ*: Our paper primarily considers the scenario of Ψ* = τ*²I_q×q. However, our method, and corresponding theoretical results, can be extended to accommodate the more general scenario of Ψ* = D* where D* is a diagonal q × q matrix.
Standardization of design matrices: The design matrices X and Z are assumed fixed and standardized with $‖ x_{j} ‖_{2}^{2} = N$ for j ∈ {1,…,p} and $‖ z_{j} ‖_{2}^{2} = n$ for j ∈ {1,…,qM}.

3. A ridge-based inferential framework

We would like to test null hypotheses of the form (1.1), i.e., $H_{0, G} : β_{j}^{*} = 0$ for all j ∈ G, for subsets G ⊂ {1,…,p}, and construct confidence intervals for $β_{j}^{*}$ . This section formally introduces our inferential framework. We first describe its foundation, the de-biased ridge estimator, and show how it can be used to accomplish these tasks. We then detail how to assemble the components needed to construct this de-biased ridge estimator and approximate its limiting distribution. Theoretical justification of our approach is provided along the way.

3.1. A de-biased ridge estimator

As in Bühlmann (2013), our starting point is the ridge estimator given by

\hat{β} = arg min_{β \in R^{p}} ‖ y - X β ‖_{2}^{2} ∕ N + λ ‖ β ‖_{2}^{2} .

(3.1)

This estimator is natural in models with homoscedastic and uncorrelated errors but in the linear mixed effect model, the random effects results in correlation. We thus refer to $\hat{β}$ from (3.1) as the ‘naive’ ridge estimator. The estimator has a simple closed form expression,

\hat{β} = N^{- 1} {(\hat{Σ} + λ I_{p \times p})}^{- 1} X^{T} Y,

(3.2)

where $\hat{Σ} = X^{T} X ∕ N$ . It is straightforward to show that the ridge estimator is normally distributed with covariance matrix, multiplied by a factor of N,

Ω^{*} = (\hat{Σ} + λ I_{p \times p})^{- 1} X^{T} V (σ^{* 2}, τ^{* 2}) X (\hat{Σ} + λ I_{p \times p})^{- 1} ∕ N .

(3.3)

As in Bühlmann (2013), we assume that the diagonal entries of $Ω^{*} = (ω_{j k}^{*})$ satisfy

ω_{min}^{*} \equiv min_{j \in {1, \dots, p}} ω_{j j}^{*} > 0 .

(3.4)

Likewise, we do not require (3.4) to be bounded away from 0 as a function of N or p. This condition, in fact, is fairly mild; it is only violated under special kinds of design matrices. To illustrate, define R ≡ rank(X) and let X = QDΓ^T be the singular value decomposition with left singular vectors $Q \in R^{N \times N}$ satisfying Q^TQ = I_N×N, $D \in R^{N \times N}$ a diagonal matrix with entries s₁ ≥ … ≥ s_N (i.e., singular values of X), and right singular vectors $Γ \in R^{p \times N}$ satisfying Γ^TΓ = I_N×N. Let ν_min(A) and ν_max(A) be the smallest and largest eigenvalue of any square matrix A, respectively. We can then show the following.

Lemma 1. Condition (3.4) holds if and only if X ≠ 0 and

min_{j \in {1, \dots, p}} \max_{k \in {1, \dots, N}, s_{k} \neq 0} Γ_{j k}^{2} > 0 .

(3.5)

In the high-dimensional case with R ≤ N < p, the parameter β* is not identifiable: many vectors $θ \in R^{p}$ satisfy Xβ* = Xθ. A natural parameter to consider, as noted in Shao and Deng (2012), is θ* = P_X^Tβ* = X^T(XX^T)⁻Xβ* = ΓΓ^Tβ*, the projection of β* onto the linear space generated by the rows of X. As it turns out, under condition (3.4), or equivalently (3.5), the ridge estimator $\hat{β}$ is a reasonable proxy for θ* when λ is sufficiently small.

Proposition 2. Suppose that λ > 0 and (3.4), or equivalently, (3.5), holds. Then, under our linear mixed effect model from Section 2, the ridge estimator (3.2) satisfies

max_{j \in {1, \dots, p}} ∣ E [{\hat{β}}_{j}] - θ_{j}^{*} ∣ \leq λ ‖ θ^{*} ‖_{2} ν_{min, +} {(\hat{Σ})}^{- 1}, min_{j \in {1, \dots, p}} Var [{\hat{β}}_{j}] \geq N ω_{min}^{*}

where $ν_{min, +} (\hat{Σ})$ refers to the smallest non-zero eigenvalue of $\hat{Σ}$ .

Proposition 2, which is proven in the Appendix, implies that the bias in estimating θ* with $\hat{β}$ is small when λ > 0 is sufficiently small. We explicitly quantify how small λ needs to be for the estimation bias to be smaller than the standard error of $\hat{β}$ .

Corollary 3. Suppose that the ridge penalty parameter λ > 0 is chosen such that $\frac{λ}{\sqrt{ω_{min}^{*}}} \leq \frac{ν_{min, +} (\hat{Σ})}{\sqrt{N} ‖ θ^{*} ‖_{2}}$ , and that condition (3.4), or equivalently, (3.5) holds. Then,

max_{j \in {1, \dots, p}} ∣ E [{\hat{β}}_{j}] - θ_{j}^{*} ∣ \leq min_{j \in {1, \dots, p}} \sqrt{Var [{\hat{β}}_{j}]} .

Our interest, however, lies in β*, not θ*. Thus, for $\hat{β}$ to be useful, we need to adjust $\hat{β}$ for the projection bias $B_{j} = θ_{j}^{*} - β_{j}^{*}$ . By definition of θ*, one observes that

B_{j} = (P_{X^{T}} β^{*})_{j} - β_{j}^{*} = (P_{X^{T}})_{j j} β_{j}^{*} - β_{j}^{*} + \sum_{k \neq j} (P_{X^{T}})_{j k} β_{k}^{*},

(3.6)

which, under the null hypothesis H_0,j : $β_{j}^{*} = 0$ , becomes,

B_{H_{0}, j} = \sum_{k \neq j} (P_{X^{T}})_{j k} β_{k}^{*} .

(3.7)

The quantity can be approximated by

{\hat{B}}_{H_{0}, j} = \sum_{k \neq j} (P_{X^{T}})_{j k} {\hat{β}}_{k}^{init} .

(3.8)

where ${\hat{β}}^{init}$ is a consistent initial estimator of β* (and consistency occurs under additional assumptions). Consider then the corrected ridge estimator ${\hat{β}}_{j}^{corr}$ as a statistic for testing H_0,j:

{\hat{β}}_{j}^{corr} = {\hat{β}}_{j} - {\hat{B}}_{H_{0}, j} = {\hat{β}}_{j} - \sum_{k \neq j} (P_{X^{T}})_{j k} {\hat{β}}_{k}^{init} .

(3.9)

Assuming that $min_{j \in {1, \dots, p}} ω_{min}^{*} > 0$ , we can write

{\hat{β}}_{j}^{corr} = W_{j} + γ_{j},

where

γ_{j} = (P_{X^{T}})_{j j} β_{j}^{*} - \sum_{k \neq j} (P_{X^{T}})_{j k} ({\hat{β}}_{k}^{init} - β_{k}^{*}) + δ_{j}, δ_{j} = δ_{j} (λ) = E [{\hat{β}}_{j}] - θ_{j}^{*} .

A rearrangement of the above set of equations yields

\frac{{\hat{β}}_{j}^{corr}}{(P_{X^{T}})_{j j}} - β_{j}^{*} = \frac{W_{j}}{(P_{X^{T}})_{j j}} - \sum_{k \neq j} \frac{(P_{X^{T}})_{j k}}{(P_{X^{T}})_{j j}} ({\hat{β}}_{k}^{init} - β_{k}^{*}) + \frac{δ_{j}}{(P_{X^{T}})_{j j}} .

(3.10)

Then, from model (2.3), it follows that

W_{1}, \dots, W_{p} \sim N (0, Ω^{*} ∕ N) .

(3.11)

The normalizing factors needed to bring the W_j to N(0, 1) scale are given by $κ_{j} = κ_{j} (N, p) = \sqrt{N ∕ ω_{j j}^{*}}$ . The proof is straightforward.

Theorem 4. Suppose we choose the ridge penalty parameter λ > 0 such that

λ {(ω_{min}^{*})}^{- 1 ∕ 2} = o (ν_{min, +} (\hat{Σ}) ∕ (N^{1 ∕ 2} ‖ θ^{*} ‖_{2})), (N, p \to \infty),

(3.12)

and assume that for our choice of ${\hat{β}}^{i n i t}$ , there exist constants C_j = C_j(N, p) such that

P [⋂_{j = 1}^{p} {∣ κ_{j} (N, p) \sum_{k \neq j} (P_{X^{T}})_{j k} ({\hat{β}}_{k}^{i n i t} - β_{k}^{*}) ∣ \leq C_{j} (N, p)}] \to 1 (N, p \to \infty) .

(3.13)

Then, under the null hypothesis, H_0,j, for all w > 0,

\underset{N, p \to \infty}{lim sup} P [∣ κ_{j} {\hat{β}}_{j}^{c o r r} ∣ > w] - P [∣ \tilde{W} ∣ + C_{j} > w] \leq 0,

(3.14)

where $\tilde{W} \sim N (0, 1)$ . In addition, for any sequence of subsets G_p ⊆ {1,…,p}, if H_{0,G_p} is true, then for any w > 0,

\underset{N, p \to \infty}{lim sup} P [\max_{j \in G_{p}} ∣ κ_{j} {\hat{β}}_{j}^{c o r r} ∣ > w] - P [\max_{j \in G_{p}} (∣ \tilde{W} ∣ + C_{j}) > w] \leq 0 .

(3.15)

In subsequent sections, we identify specific scalings of N and p such that Theorem 4 becomes applicable. Based on the asymptotic distributions in Theorem 4, we can construct p-values for testing H_0,G, G ⊆ {1,…,p}. For testing the individual null hypothesis H_0,j, we define the p-value for the two-sided alternative as

ϱ_{j} = 2 (1 - Φ ((κ_{j} ∣ {\hat{β}}_{j}^{corr} ∣ - C_{j})_{+})),

(3.16)

where Φ is the standard normal distribution function. For testing the group null hypothesis H_0,G, ∣G∣ > 1, we define the p-value as

ϱ_{G} = 1 - P [\max_{j \in G} (κ_{j} ∣ W_{j} ∣ + C_{j}) \leq \max_{j \in G} κ_{j} ∣ {\hat{β}}_{j}^{corr} ∣],

(3.17)

where W₁,…,W_p are as in (3.11). From Theorem 4, we can derive the following corollary.

Corollary 5. Under the conditions in Theorem 4, for any α ∈ (0, 1), the following statements hold:

\underset{N, p \to \infty}{lim sup} P [ϱ_{j} \leq α] - α \leq 0 i f H_{0, j} i s t r u e, \underset{N, p \to \infty}{lim sup} P [ϱ_{G} \leq α] - α \leq 0 i f H_{0, G} i s t r u e .

3.2. Consistent estimation of variance parameters

As presented, the de-biased ridge framework depends on the values of the unknown parameters σ*² an τ*². We employ a two-step approach to consistent estimation of these parameters.

Let $S = {j : β_{j}^{*} \neq 0}$ be the support of β*, with cardinality d = ∣S∣. We use the Lasso estimator ${\hat{β}}^{L} = arg min_{β \in R^{p}} ‖ y - X β ‖_{2}^{2} ∕ N + 2 λ_{L} ‖ β ‖_{1}$ with an appropriate choice of tuning parameter λ_L to identify an initial guess of the elements (i.e., indices) in S. We define $\hat{S} = {j : {\hat{β}}_{j}^{L} \neq 0}$ as our guess for the support S. By properties of the Lasso, $∣ \hat{S} ∣ \leq N$ , although, in general, $\hat{S}$ may not be a good estimate of S.
Working with the (potentially misspecified) random effects model
$y = X_{\hat{S}} β_{\hat{S}}^{*} + Z b + ϵ,$ (3.18)

we apply Henderson’s Method III (Henderson, 1953) to form estimates ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ . Henderson’s Method III is particularly tractable theoretically and enables us to study consistency in the scenario where (3.18) is actually misspecified, i.e., $∣ S ∖ \hat{S} ∣ > 0$ . For a discussion of Henderson’s methods and the appeals of Method III, see (Searle, 1968).

In recent years Henderson’s methods have largely been supplanted by alternatives such as restricted maximum likelihood (REML) for variance component estimation (Harville, 1977); it is customary to refer to variances of random effects as variance components. We thus provide a brief overview of what Henderson’s Method III entails. Consider, first, the low-dimensional model (2.3) with p < N. To simplify the notation in the following explanation, we momentarily define $\tilde{X} = [X Z]$ . By not distinguishing between fixed and random effects, the idea behind Henderson’s methods is to match the differences in the reductions in the sum-of-squares between sub-models of (2.3) to its expected value, not unlike a method-of-moments approach. To elaborate, in fitting (2.3) to data y, the reduction in the sum of squares is

R (β, v) = y^{T} P_{\tilde{X}} y .

(3.19)

Likewise, the decrease in the sum of squares due to fitting the reduced model y = Xβ + ϵ is

R (β) = y^{T} P_{X} y .

(3.20)

The expected difference in the reductions $R (v ∣ β) \equiv R (β, v) - R (β)$ is

E [R (v ∣ β)] = τ^{* 2} tr (Z^{T} [I_{N \times N} - P_{X}] Z) + σ^{* 2} [rank (\tilde{X}) - rank (X)] .

(3.21)

Moreover,

E [y^{T} y - R (β, v)] = σ^{* 2} [N - rank (\tilde{X})] .

(3.22)

Together, (3.21) and (3.22), when matching theoretical expectations to empirical averages, form a triangular system of linear equations, from which we derive ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ . We find

{\hat{σ}}^{2} = \frac{y^{T} (I_{N \times N} - P_{\tilde{X}}) y}{N - rank (\tilde{X})},

(3.23)

{\hat{τ}}^{2} = \frac{y^{T} (P_{\tilde{X}} - P_{X}) y - {\hat{σ}}^{2} [rank (\tilde{X}) - rank (X)]}{tr (Z^{T} (I_{N \times N} - P_{X}) Z)} .

(3.24)

It is straightforward to see that the ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ generated from (3.23) and (3.24) are unbiased, presuming that the true model is y = Xβ + Zv + ϵ. For consistency, some additional assumptions are needed, which we will discuss later in this section.

Returning to our two-step procedure and high-dimensional setup, Step 1 identifies a candidate low-dimensional sub-model, which is used in Step 2 to obtain variance component estimates. We do not require the candidate model to encompass the truth; however, λ_L should be such that $\hat{S}$ , from Step 1, reliably captures the indices of the ‘strong’ signals in β*. The idea is that missing ‘weak’ signals only negligibly affect the accuracy of ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ in Step 2. We now show that this two-step procedure yields consistent estimators ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ in the setting where N → ∞ to (specifically, n is fixed, but the number of groups M → ∞) and d² logp/M = o(1), provided some additional technical assumptions hold. From here on, this will also be the scaling assumed for Theorem 4, as well as Corollary 5. We first present the assumptions necessary for consistency and then formally state the theorem.

For ξ > 1, define the cone

C (ξ, S) = {u \in R^{p} : ‖ u_{S^{c}} ‖_{1} \leq ξ ‖ u_{S} ‖_{1}} .

(3.25)

Assumption 1. For some constant ξ > 1,

ζ \equiv inf {\frac{‖ \hat{Σ} u ‖_{\infty}}{‖ u_{A} ‖_{\infty}} : u \in C_{-} (ξ, S), ∣ A ∖ S ∣ \leq p} ≳ 1

(3.26)

with $C_{-} (ξ, S) \equiv {u : u \in C (ξ, S), u_{j} Σ_{j,} . u \leq 0 \forall j \notin S}$ the sign-restricted version of (3.25).

The quantity ζ in (3.26) is defined more generally in Ye and Zhang (2010), where it is termed a sign-restricted cone invertibility factor (SCIF). We have the following lemma.

Lemma 6. Suppose Assumption 1 holds, and let λ_L be defined by (A.2) (or 3.28) for some small ε > 0 and ξ as in Assumption 1. If u* ≤ λ_L(ξ − 1)/(ξ + 1), then

‖ {\hat{β}}^{L} - β^{*} ‖_{\infty} \leq \frac{λ_{L} + u^{*}}{ζ} \leq \frac{2 ξ λ_{L}}{(ξ + 1) ζ} .

(3.27)

In the proof of Lemma 6 (provided in the Appendix), SCIF naturally appears when deriving an upper bound for $‖ {\hat{β}}^{L} - β^{*} ‖_{\infty}$ . Lemma 6 assumes that Assumption 1 is satisfied, and that λ_L in Step 1 is chosen such that

λ_{L} = \frac{(ξ + 1)}{(ξ - 1)} \sqrt{\frac{2 (σ^{* 2} + τ^{* 2} q n) (\log p - \log (ε ∕ 2))}{N}} ≍ \sqrt{\frac{\log p}{q M}} = o (1),

(3.28)

with ξ as in Assumption 1. It then establishes that

‖ {\hat{β}}^{L} - β^{*} ‖_{\infty} \leq 2 ξ λ_{L} ∕ ζ (ξ + 1) = o (1),

with probability exceeding 1 – ε, where ε > 0 can be taken arbitrarily small. A direct implication is that if the lemma’s conditions are satisfied, $S ∖ \hat{S}$ only includes indices corresponding to ‘weak’ signals in β* of magnitude less than 4ξλ_L/ζ(ξ + 1) = o(1) with close to certainty, which is part of what Step 1 sets out to achieve.

Assumption 2. There exists an integer N′ ≲ d such that for the same constant ξ > 1 as in Assumption 1,

\frac{d ξ^{2}}{ψ^{2} (ξ, S)} < \frac{N^{'}}{ψ_{+} (N^{'}, S)},

(3.29)

where

ψ (ξ, S) = min {\frac{d^{1 ∕ 2} ‖ X u ‖_{2}}{N^{1 ∕ 2} ‖ u_{S} ‖_{1}} : u \in C (ξ, S), u \neq 0}

(3.30)

and $ψ_{+} (N^{'}, S) = \max_{A \cap S = \emptyset, ∣ A ∣ \leq N^{'}} ν_{min} (\frac{X_{A}^{T} X_{A}}{N})$ is the sparse upper eigenvalue of models disjoint with S.

Assumption 2 is needed to control the number of false positive selections in $\hat{S}$ from Step 1. In particular, we have

Lemma 7. Suppose that Assumption 2 holds, and λ_L is defined according to (3.28). In the event that u* ≤ λ_L(ξ − 1)/(ξ + 1), $∣ \hat{S} ∖ S ∣ < N^{'}$ .

Put simply, Lemma 7 claims that under Assumption 2 and our choice of λ_L from (3.28), the total number of false selections in Step 1 is bounded by N′, with probability exceeding 1 – ε. The proof is provided in the Appendix.

Assumption 3. Let $\overset{ˇ}{X}$ be formed by joining any N′ columns in X with $β_{j}^{*} = 0$ to the d support columns in X. For the same N′ as in Assumption 2,

rank ([I_{N \times N} - P_{\overset{ˇ}{X}}] Z) = rank (Z) = q M,

(3.31)

Z^{T} [I_{N \times N} - P_{\overset{ˇ}{X}}] Z ≳ I_{q M \times q M},

(3.32)

and the qM singular values of $[I_{N \times N} - P_{\overset{ˇ}{X}}] Z, s_{1}, \dots, s_{q M}$ , satisfy

\frac{∣ {i : s_{i} \neq 0} ∣}{{(\sum_{i = 1}^{q M} s_{i}^{2})}^{2}} = o (1) .

(3.33)

By (3.31) in Assumption 3, the fixed data matrix Z has full column rank, and no column vector of Z can be represented as a linear combination of the column vectors of any ‘feasible’ $X_{\hat{S}}$ assuming that λ_L is chosen according to (3.28). After all, N′ + d is the upper bound on the number of selected fixed effects with probability exceeding 1 – ε (Lemma 7). Additionally, by (3.32), the sum of the squared perpendicular distances between each column vector in Z and its projection onto the linear subspace spanned by the column vectors of feasible $X_{\hat{S}}$ matrices is at least on the order of qM (substantial, given there are qM columns in Z). The latter half of Assumption 3 requires all columns of $(I_{N \times N} - P_{\overset{ˇ}{X}}) Z$ are ‘close’ to being linearly independent from one another and ‘contribute equally’ to its rank. In particular, note that (3.33) is satisfied if

c_{1} < \frac{s_{j}}{s_{k}} < c_{2} for j \neq k and some constants c_{1}, c_{2} > 0 .

(3.34)

It is thus clear that (3.31) and (3.32) imply that random effects must not be confounded with any ‘feasible’ set of fixed effects (from Step 1) while (3.33) implies that the random effects are not confounded from one another. Analogous conditions were shown to be necessary to prove consistency of REML estimators in Jiang (1996).

Assumption 4. For any j ∈ S such that $∣ β_{j}^{*} ∣ < 4 ξ λ_{L} ∕ ζ (ξ + 1)$ , with λ_L defined as in (3.28), $‖ Γ_{\tilde{X}} x_{j} ‖_{\infty} ≍ 1$ . Here, $Γ_{\tilde{X}} D_{\tilde{X}} Γ_{\tilde{X}}^{T}$ is the eigen-decomposition of $\tilde{X} {({\tilde{X}}^{T} \tilde{X})}^{-} {\tilde{X}}^{T}$ (defined for this Assumption) with $\tilde{X} = [\overset{ˇ}{X} Z]$ , where $\overset{ˇ}{X}$ is formed by joining any N′ columns in X with $β_{j}^{*} = 0$ to the d – 1 support (excluding j) columns in X. The N′ referenced here is the same as in Assumptions 2 and 3.

Assumption 4 requires that covariates corresponding to weak (but non-zero) signals in β* (for which we cannot quantify a bound on the probability they are to be included in $X_{\hat{S}}$ ) are not too strongly correlated to covariates in $X_{\hat{S}}$ nor covariates associated with the random effects. This somewhat resembles the irrepresentability conditions needed for model selection consistency in Lasso–see, e.g., Zhao and Yu (2006). However, the two assumptions are very different: Aside from differences in the quantities involved, a key difference is that the irrepresentability condition requires a very stringent upper bound on non-confounding between fixed effects, whereas Assumption 4 only requires boundedness. As shown in the numerical experiments in the Appendix, as the number of covariates and sparsity of the model vary, Assumption 4 is very likely to be satisfied with even small bounds, whereas the irrepresentability condition is increasingly less likely to hold.

We can now state our main result on consistency of variance component estimators, which validates our two-step procedure.

Theorem 8. Consider N, p → ∞ with n fixed, M → ∞. Furthermore, suppose p → ∞ with d²q log p/M = o(1). Suppose Assumptions 1-4 are satisfied and Λ_L is chosen according to (3.28) with ε ∝ 1/p. Then, ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ are consistent for σ*² and τ*², respectively, i.e.,

∣ {\hat{σ}}^{2} - σ^{* 2} ∣ = ∣ {\hat{τ}}^{2} - τ^{* 2} ∣ = o_{P} (1) (N, p \to \infty) .

(3.35)

Because $∣ {\hat{σ}}^{2} - σ^{* 2} ∣$ and $∣ {\hat{τ}}^{2} - τ^{* 2} ∣$ are both o_P(1), we can use ${\hat{σ}}^{2}$ and ${\hat{τ}}^{2}$ as plug-in values for σ*² and τ*², respectively. From there, we can form a consistent estimator of Ω* and normalizing constants κ_j.

For practical applications, REML can be used as a substitute for Henderson’s Method III for Step 2. Theory for REML would be a possible avenue for further explorations.

3.3. An initial estimator for β* and our choice of C_j

To form ${\hat{β}}^{init}$ , we consider the ordinary least-squares (OLS) fit restricted to $\hat{S}$ , i.e.,

{\hat{β}}^{init} = arg min_{β \in R^{p} : β_{{\hat{S}}^{c}} = 0} ‖ y - X β ‖_{2}^{2} .

(3.36)

We proceed to demonstrate that the error ${\hat{β}}^{init} - β^{*}$ is o(1) in ∓₁ norm.

Assumption 5. For the same N′ as in Assumptions 2, 3, 4, the sparse lower eigenvalue for models containing S of cardinality smaller than d + N′ is constant and greater than 0,

ψ_{-} (N^{'}, S) = min_{A \supset S, ∣ A ∖ S ∣ \leq N^{'}} ν_{min} (\frac{X_{A}^{T} X_{A}}{N}) ≳ 1,

Assumption 5, in conjunction with previous assumptions and choice of Λ_L (3.28), can be used to control the ℓ₁ norm of the estimation error ${\hat{β}}^{init} - β^{*}$ .

Theorem 9. Suppose Assumptions 1-5 hold. Under the same conditions as in Theorem 8, for some universal constant C > 0,

‖ {\hat{β}}^{i n i t} - β^{*} ‖_{1} \leq C d \sqrt{\frac{q \log p}{M}}

(3.37)

with probability converging to 1 as N, p → ∞.

Theorem 9 implies that we have the following crude bound, based on Hölder’s inequality,

∣ κ_{j} \sum_{k \neq j} (P_{X^{T}})_{j k} ({\hat{β}}_{k}^{init} - β_{k}^{*}) ∣ \leq κ_{j} \max_{k \neq j} ∣ (P_{X^{T}})_{j k} ∣ ‖ {\hat{β}}^{init} - β^{*} ‖_{1} \leq κ_{j} \max_{k \neq j} ∣ (P_{X^{T}})_{j k} ∣ C d λ_{L} .

(3.38)

The following corollary is a direct consequence of the crude bound (3.38).

Corollary 10. Suppose the conditions in Theorem 9 are satisfied, and that d, the sparsity of β*, satisfies d ≤ C⁻¹ (M/(q log p))^η, with C as in Theorem 9 and η ∈ (0, 1/2). Then,

C_{j} = \max_{k \neq j} ∣ κ_{j} (P_{X^{T}})_{j k} ∣ {(\frac{q \log p}{M})}^{1 ∕ 2 - η}

(3.39)

satisfies condition (3.13) in Theorem 4.

4. Numerical experiments

4.1. A practical choice for λ_L

In practical applications, we run into the issue of not being able to set λ_L according to (3.28), as it involves knowing τ* and σ*. However, we can derive a (slightly ad-hoc) approximation of what λ_L should be. Upon closer examination of the proof of Lemma 11, we can substitute the term σ*² + τ*²qn with ν_max(V(σ*, τ*)) = σ*² + τ*²ν_max(Z^TZ). The latter can be approximated according to the following procedure, assuming that the ratio τ*/σ* is not too small:

Apply scaled lasso (Sun and Zhang, 2012) to obtain an initial ‘average’ noise estimate. The solution to the scaled lasso problem is characterized by
$({\hat{β}}^{scaled}, {\hat{σ}}^{scaled}) \in arg min_{β, σ} \frac{‖ y - X β ‖_{2}^{2}}{2 N σ} + \frac{σ}{2} + λ_{univ} ‖ β ‖_{1}$ (4.1)

with $λ_{univ} = \sqrt{2 \log p ∕ N}$ .
Take $λ_{L} = {\hat{σ}}^{scaled} λ_{univ} ρ_{Z}$ with

ρ_{Z} = \sqrt{\frac{ν_{\max} (Z^{T} Z)}{tr (Z^{T} Z) ∕ N}} .

(4.2)

We provide a heuristic justification. Ignoring the finer details involved in the theory, for the scaled lasso, $({\hat{σ}}^{scaled})^{2}$ serves as a good approximation for $‖ ϵ^{*} ‖_{2}^{2} ∕ N$ , where we have defined ϵ* = y – Xβ*. In linear models, ϵ* holds i.i.d. observations drawn from a N(0, σ*²) distribution. By the law of large numbers, $‖ ϵ^{*} ‖_{2}^{2} ∕ N$ converges to σ*² for large N. Under a heteroskedastic error model, with ϵ* independent and $ϵ_{i}^{*} \sim N (0, σ_{i}^{* 2})$ , we can match $‖ ϵ^{*} ‖_{2}^{2} ∕ N$ to its expectation, which is given by $\sum_{i = 1}^{N} σ_{i}^{* 2} ∕ N$ , so $({\hat{σ}}^{scaled})^{2}$ can be used to approximate the ‘average’ noise level. If ϵ* ~ N (0, V(σ*, τ*)), then using a similar expectation matching argument, we can expect $({\hat{σ}}^{scaled})^{2}$ to act as a surrogate for

σ^{* 2} + \frac{τ^{* 2} tr (Z Z^{T})}{N},

(4.3)

which follows from the fact that ∥Γϵ*∥₂ = ∥ϵ*∥₂ for any N × N orthogonal matrix Γ (overloading Γ from (3.5)). What we actually need is σ*² + τ*²ν_max(Z^TZ). Then in the scenario where ratio τ*²/σ*² is not too small, ρ_Z from (4.2) should give us a choice of λ_L that is close to the desired one from (3.28). Our choice of λ_L is constructed according to the above procedure for all subsequent numerical experiments.

4.2. A look into p-values

Denote the ‘unblocked’ version of Z as Z_u; i.e., Z_u is a N × q matrix formed by row-wise concatenating the M diagonal blocks in Z. We generate data from model (2.1) according to following schemes, setting M = 25 and n = 6:

(M1) For p ∈ {300, 600}, q ∈ {1, 2}, we construct [X Z_u] from N i.i.d. realizations from a $N (0, Φ^{*})$ distribution with Φ* = {ϖ_jk} a (p + q) × (p + q) matrix with $ϕ_{j k}^{*} = {0.2}^{∣ j - k ∣}$ . X and Z (the ‘blocked’ version) are then normalized such that $‖ x_{j} ‖_{2}^{2} = N$ and $‖ z_{j} ‖_{2}^{2} = n$ for all j. For b ∈ {0.5, 1}, we set the p-dimensional vector of fixed regression coefficients to
$β = [b, \dots, b, 0, \dots, 0],$

where, the first d ∈ {5, 10} entries of β are nonzero. The variance parameters σ* and τ* are set to 0.5 and 1 respectively.
(M2) Same as (M1) except with Φ* = I_(p+q)×(p+q).

The numerical experiments are setup similarly to those in Bühlmann (2013) and Schelldorfer et al. (2011). We set the ridge penalty parameter λ to 1/N for all experiments. Additionally, we set C_j according to Corollary 10 with η = 0.005.

We first consider null hypotheses of the form

H_{0, j} : β_{j} = 0 .

(4.4)

We consider decision rules based on a significance level α = 0.05, i.e., we reject H_0,j if the event E_j = {ϱ_j ≤ 0.05} occurs, where ϱ_j is as defined in (3.16). Following Bühlmann (2013), we evaluate the performance of the tests based on the type I error, averaged over the non-support indices,

Avg. type I error = (p - d)^{- 1} \sum_{j \in S^{c}} \hat{P} (E_{j}),

(4.5)

and the power, averaged over the support indices,

Avg. power = d^{- 1} \sum_{j \in S} \hat{P} (E_{j}),

(4.6)

where $\hat{P}$ denotes the empirical probability over 1000 simulations. The results, presented in Figure 1, suggest that type I error is well-controlled for all combinations of p, q, b and d for the two different models. Power is high in most scenarios, but appears to vary with the aforementioned quantities, noticeably decreasing with b. However, this is to be expected.

We also consider null hypotheses of the form

H_{0, G} : β_{j} = 0 for all j \in G .

(4.7)

with G taken either to be {1,…,100} (G1), or {101,…,200} (G2). By construction, the hypothesis H_0,G1 should be accepted while H_0,G2 rejected. We consider decision rules based on a significance level α = 0.05 and reject H_0,G if the event E_G = {ϱ_G ≤ 0.05} occurs, with ϱ_G defined in (3.17). To evaluate the performance of these tests, we consider type I error and power, which can be represented by $\hat{P} (E_{G 2})$ and $\hat{P} (E_{G 1})$ , respectively, where again, $\hat{P}$ denotes the empirical probability over 1000 simulations. Figure 2 visualizes the results.

Figure 2: — Average power vs. average type I error for testing groups of coefficients under the two models for different combinations of p, q, b and d.

4.3. Comparisons with existing methods

In this section, we conduct a short numerical example to examine whether one could ‘naively’ apply inferential procedures for high-dimensional linear models to obtain inference for parameters in mixed models.

Consider Model (M1) from Section 4.2 in the instance of p = 300 and q = 1. Let β* = [0.05, 2, 4, 3, 0.1, 0,…, 0]. We compare our method against

ridge-based inference procedure of Bühlmann (2013), which is an analogue of our method developed for high-dimensional linear models;
lasso-based inference procedure of van de Geer et al. (2014), which entailes de-sparsifying a lasso estimator.

The differences are fairly evident when comparing confidence interval coverage. For any α ∈ (0, 1), define $Q_{α} [W_{j}]$ as the α-th quantile of the distribution of W_j. Under the conditions of Theorem 4, if the assumed model is correct, (3.11) suggests that confidence intervals of the form

[\frac{{\hat{β}}_{j}^{corr}}{(P_{X^{T}})_{j j}} - \frac{Q_{1 - α ∕ 2} [W_{j}] + C_{j}}{(P_{X^{T}})_{j j}}, \frac{{\hat{β}}_{j}^{corr}}{(P_{X^{T}})_{j j}} + \frac{Q_{1 - α ∕ 2} [W_{j}] + C_{j}}{(P_{X^{T}})_{j j}}]

should guarantee coverage of at least (1 – α)%. Rather than setting C_j according to Corollary 10, we set them to be the same as the ‘C_j-analogues’ from Bühlmann (2013), to make the two methods comparable. Our choice of C_j are larger than theirs, so if anything, this ad-hoc decision provides Bühlmann (2013)’s method an unfair advantage. In Figure 3, we examine 95% confidence interval coverage for the three methods, based on the above modifications.

Figure 3: — Confidence interval coverage for $β_{j}^{*}$ , j = 1,…,p; target coverage is 95% (with 1000 simulations, the standard deviation is ~0.69%). Here, lmm () refers to our method; ridge () to the method of Bühlmann (2013); and lasso () to the method of van de Geer et al. (2014).

Inline graphic — Confidence interval coverage for $β_{j}^{*}$ , j = 1,…,p; target coverage is 95% (with 1000 simulations, the standard deviation is ~0.69%). Here, lmm () refers to our method; ridge () to the method of Bühlmann (2013); and lasso () to the method of van de Geer et al. (2014).

Overall, our method, which accounts for random effects, performs best at attaining the target guaranteed coverage across all $β_{j}^{*} ’ s$ , compared to the methods proposed in Bühlmann (2013) and van de Geer et al. (2014). While Bühlmann (2013)’s method does come close, coverage falls short at 16 indices: minimum coverage achieved was 92.9% (with 1000 simulations, this is a statistically significance difference from 0.95). At initial glance it appears that the lasso-based method from van de Geer et al. (2014) performs quite well; however, a closer examination of the results reveals otherwise. Specifically, the lasso-based method does very poorly over some of the active coefficients, as made evident in Table 1.

Table 1:

Confidence interval coverage for signals $β_{j}^{*}$ , j = 1,…, 5; target coverage is 95%.

	Our method	Bühlmann (2013)	van de Geer et al. (2014)
$β_{1}^{*}$	0.977	0.974	0.994
$β_{2}^{*}$	0.973	0.963	0.865
$β_{3}^{*}$	0.969	0.971	0.782
$β_{4}^{*}$	0.971	0.972	0.886
$β_{5}^{*}$	0.983	0.993	1.000

Open in a new tab

5. An application to riboflavin production data

In this section, we apply our proposed methodology to data on riboflavin (vitamin B₂) production by Bacillus subtilis. The data is made publicly available by Bühlmann et al. (2014); the original data was provided by DSM (Switzerland). The dataset, referenced as riboflavinGrouped, has M = 28 specimens measured at two to six time points, resulting in N = 111 observations in total. For each specimen at each time point, we record a single real valued response variable, the log-transformed riboflavin production rate, as well as the expression levels of p = 4088 genes. We are interested in identifying which gene is significantly correlated with riboflavin production.

To account for correlations induced by repeated measurements, a natural model to consider is the random intercept model, in which we assume that

y_{m} = X_{m} β^{*} + v_{m} + ϵ_{m},

(5.1)

with v_m, m = 1,…, M i.i.d. with v_m ~ N(0, τ*²), and ϵ_m, m = 1,…, M, independent with ϵ_m ~ N(0,σ*²I_{n_m×n_m}), and generated independently of v₁,… v_m. Note that (5.1) can be represented by (2.1) with the Z_m’s taken to be column vectors of 1s of lengths n_m. Most of the theoretical results assume the n_m’s are equal, but it is straightforward to show the results hold so long as n_m are on the same order of magnitude, as they are here.

We apply our proposed framework and compute the marginal p-values for testing $β_{j}^{*} = 0$ . Controlling the family-wise error rate (FWER) at 5%, via a simple Bonferroni correction, we find a single significant gene in riboflavin production: YXLD-at. This result matches previous findings by Javanmard and Montanari (2014) and Meinshausen et al. (2009) using an homogeneous dataset with N = 71 samples provided by the same source (riboflavin in Bühlmann et al. (2014)). Like us, Meinshausen et al. (2009) makes a single discovery, YXLD-at, while Javanmard and Montanari (2014) also labels YXLE-at as significant. The method of Bühlmann (2013), on the other hand, makes no discoveries.

6. Discussion

We presented a new framework for constructing asymptotically valid p-values and confidence intervals for the fixed effects in high-dimensional linear mixed effect models. It entails de-biasing a ‘naive’ ridge estimator, whose asymptotic distribution we can approximate sufficiently well if the number of independent groups of observations M scales at least with d²q log p. Simulation studies in high-dimensional suggest that our method provides good control of type-I error. It also provides good results for a riboflavin dataset with group structure, where we confirmed results obtained in earlier work based on a homogeneous dataset from the same source (Javanmard and Montanari, 2014; Meinshausen et al., 2009).

Several extensions to our methodology would be of interest for future work. First, our proposal for selecting the tuning parameter λ_L relies on the assumption that τ*²/σ*² is not too small. Although it appears to work well in practice, one could also consider an iterative scheme that repeatedly updates λ_L based on the resultant estimates of σ*² and τ*²: this can be readily implemented in practice but may be difficult to validate theoretically. Second, here we required the number of random effects q to be quite small (treated as constant in the theory). This assumption can be relaxed by, e.g., taking Ψ* to be a general diagonal matrix, i.e., $Ψ^{*} = diag (τ_{1}^{* 2}, \dots, τ_{q}^{* 2})$ , and assuming that a small number of $τ_{j}^{* 2} ’ s$ are nonzero, i.e., cardinality of $T \equiv {j : τ_{j}^{* 2} \neq 0}$ is small, less than n. Then, instead of screening for fixed effects in Step 1, we can screen for both fixed and random effects by incorporating a double penalization scheme as in Li et al. (2018). This way, in Step 2, both $∣ \hat{S} ∣$ and $∣ \hat{T} ∣$ are small, and we can apply Henderson’s method III as before.

A few other details should also be discussed for completeness. First, multiple testing can be handled using the Westfall-Young procedure of Bühlmann (2013). This multiple testing adjustment, which strongly controls the family-wise error rate, can directly be used in conjunction with our method for generating p-values for the individual hypothesis tests. Second, the ridge-based framework of Bühlmann (2013), which is a basis for our method, is known to not have optimal power. Bühlmann (2013) shows that the detection rate may be larger than N^−1/2, whereas, under certain conditions, the detection limit for the de-biased lasso approach of Zhang and Zhang (2014) is in the N^−1/2 range. A possible extension of our work is to build a lasso-based inferential framework for high-dimensional linear mixed effect models. In fact, as suggested in the Introduction, our methods can be adapted to other high-dimensional estimators; and ridge is just an example. From van de Geer et al. (2014), we can obtain asymptotically optimal inference for linear fixed effect models—i.e., for y = Xβ* + ϵ with N observations and ϵ_i i.i.d N(0, σ*²)—by leveraging the fact that the Lasso estimator with non-negative penalty parameter λ, $\hat{β} (λ)$ , can be rewritten as

\hat{β} (λ) - β^{*} + λ \hat{Θ} \hat{ı} = λ \hat{Θ} X^{T} ϵ ∕ N - Δ ∕ \sqrt{N}, where Δ ≔ \sqrt{N} (\hat{Θ} \hat{Σ} - I_{p \times p}) (\hat{β} (λ) - β^{*})

by inverting the KKT conditions, with $\hat{ı}$ arising from the subdifferential of ∥β∥₁. Taking $\hat{Θ}$ to be a reasonably good approximation of an inverse of $\hat{Σ}$ , the Δ term becomes asymptotically negligible, and we can use the normality of ϵ to develop asymptotically valid tests and confidence intervals for β*. (The scaled lasso furnishes a consistent estimator of σ*².) Extending this approach to the linear mixed-effect setup (per Section 2) requires meeting the challenge that the ϵ_i are no longer i.i.d., which could be addressed using the methods of Section 3.2.

A. Appendix

A.1. Proof of Results in Section 3

A.1.1. Proof of Lemma 1

It is straightforward to show that Ω* can be lower bounded as

Ω^{*} \geq c (\hat{Σ} + λ I_{p \times p})^{- 1} \hat{Σ} (\hat{Σ} + λ I_{p \times p})^{- 1} \equiv {\tilde{Ω}}^{*},

for some c satisfying 0 < c < ν_min (V(σ*², τ*²)). Since σ*² is positive, ν_min (V(σ*², τ*²)) > 0. Note that ${\tilde{Ω}}^{*}$ can alternatively be written as

{\tilde{Ω}}^{*} = Γ diag (\frac{s_{1}^{2}}{(s_{1}^{2} + λ)^{2}}, \dots, \frac{s_{N}^{2}}{(s_{N}^{2} + λ)^{2}}) Γ^{T},

which, in turn, implies that

{\tilde{ω}}_{min}^{*} = min_{j \in {1, \dots, p}} \sum_{k = 1}^{N} \frac{s_{k}^{2}}{(s_{k}^{2} + λ)^{2}} Γ_{j k}^{2},

and the claim follows.

A.1.2. Proof of Proposition 2

This was proven in Shao and Deng (2012) (see proof of their Theorem 1). Define Γ = [Γ′ (Γ)_⊥]; Γ′ is orthogonal, i.e., Γ′^TΓ′ = Γ′Γ′^T = I_p×p . By definition (3.2), we have

E [\hat{β}] - θ^{*} = \frac{1}{N} (\hat{Σ} + λ I_{p \times p})^{- 1} X^{T} X θ^{*} - θ^{*} = - (λ^{- 1} N^{- 1} X^{T} X + I_{p \times p})^{- 1} θ^{*} = - Γ^{'} (λ^{- 1} N^{- 1} Γ^{' T} X^{T} X Γ^{'} + I_{p \times p})^{- 1} Γ^{' T} Γ Γ^{T} θ^{*} - Γ (λ^{- 1} N^{- 1} D^{2} + I_{R \times R})^{- 1} Γ^{T} θ^{*} .

Observing that the diagonal entries to D are positive, one obtains

(λ^{- 1} N^{- 1} D^{2} + I_{R \times R})^{- 1} \underline{≺} \frac{λ^{- 1} ∕ ν_{min, +} (\hat{Σ})}{1 + λ^{- 1} ∕ ν_{min, +} (\hat{Σ})} I_{R \times R},

(A.1)

which, combined with the fact that Γ^TΓ = I_R×R, we obtain

\max_{j \in {1, \dots, p}} ∣ E [{\hat{β}}_{j}] - θ_{j}^{*} ∣ \leq λ ‖ θ^{*} ‖_{2} ν_{min, +} (\hat{Σ})^{- 1},

as desired. The bound on the variance follows directly from (3.3).

A.2. Proof of Theorems 4, 8 and 9

We first establish Theorem 4, which follows directly from Proposition 2.

A.2.1. Proof of Theorem 4

It follows from Proposition 2 that

\max_{j} κ_{j} ∣ δ_{j} ∣ = \max_{j} κ_{j} ∣ E [{\hat{β}}_{j}] - θ_{j}^{*} ∣ \leq \frac{λ ‖ θ^{*} ‖_{2} ν_{min, +} {(\hat{Σ})}^{- 1}}{N^{- 1 ∕ 2} ω_{j j}^{* 1 ∕ 2}} \leq \frac{λ ‖ θ^{*} ‖_{2} ν_{min, +} {(\hat{Σ})}^{- 1}}{N^{- 1 ∕ 2} ω_{min}^{* 1 ∕ 2}},

which, due to our choice of the ridge penalty parameter λ > 0 in (3.12), is o(1) as N, p α ∞. The claim now follows from (3.10) and the assumption given by (3.13).

Because there is an overlap in the lemmas used to prove Theorems 8 and 9, we present them together. Define u* = ∥X^T(y – Xβ*)∥_∞/N.

Lemma 11.Let

λ_{L} = \frac{(ξ + 1)}{(ξ - 1)} \sqrt{\frac{2 (σ^{* 2} + τ^{* 2} q n) (\log (p) - \log (ε ∕ 2))}{N}} .

(A.2)

Under the model given by (2.3), the event u* ≤ λ_L(ξ – 1)/(ξ + 1) occurs with probability greater than 1 – ε.

Proof. Define $u_{j} = x_{j}^{T} (y - X β^{*}) ∕ N$ . Then u* = max_j∣u_j∣. Under model (2.3), we observe that,

u_{j} \sim N (0, x_{j}^{T} V (σ^{* 2}, τ^{* 2}) x_{j})

It follows from the Gaussianity of u_j (in fact, sub-Gaussianity would suffice) that

P [∣ u_{j} ∣ > λ_{L} (ξ - 1) ∕ (ξ + 1)] \leq 2 exp {- \frac{λ_{L}^{2} (ξ - 1)^{2} ∕ (ξ + 1)^{2}}{2 x_{j}^{T} V (σ^{* 2}, τ^{* 2}) x_{j}}} \leq 2 exp {- \frac{λ_{L}^{2} (ξ - 1)^{2} ∕ (ξ + 1)^{2}}{2 N v_{\max} (V (σ^{* 2}, τ^{* 2}))}} \leq \frac{ε}{p} .

(A.3)

The second inequality follows from the fact that the columns of X are standardized such that $‖ x_{j} ‖_{2}^{2} = N \forall j$ . For the third inequality, recall that the columns of Z are standardized such that $‖ z_{j} ‖_{2}^{2} = n \forall j$ , which implies that the largest eigenvalue of V(σ*², τ*²), the true covariance of y, satisfies ν_max(V(σ*, τ*)) ≤ σ*² + τ*²qn. The third inequality in (A.3) is obtained by plugging in our choice of λ_L (3.28). Employing a union bound, we then have

P [u^{*} \leq λ_{L} (ξ - 1) ∕ (ξ + 1)] \geq 1 - \sum_{j = 1}^{p} P [∣ u_{j} ∣ > λ_{L} (ξ - 1) ∕ (ξ + 1)] \geq 1 - ε,

This is our desired result.

A.2.2. Proof of Lemma 6

We use arguments similar to those employed in the proof of Theorem 3 in Ye and Zhang (2010). Suppose that u* ≤ λ_L. Define $h = {\hat{β}}^{L} - β^{*}$ . The Karuhn-Kush-TUcker (KKT) optimality conditions for Lasso is given by

{\begin{matrix} \frac{x_{j}^{T} (y - X {\hat{β}}^{L})}{N} = λ_{L} sign ({\hat{β}}_{j}^{L}), & {\hat{β}}_{j}^{L} \neq 0, \\ \frac{x_{j}^{T} (y - X {\hat{β}}^{L})}{N} \in λ_{L} [- 1, + 1], & {\hat{β}}_{j}^{L} = 0 . \end{matrix}

With some rearrangement, the KKT conditions can be rewritten as

\frac{X^{T} (y - X β^{*}) - \hat{Σ} h}{N} = λ_{L} \hat{ı}

(A.4)

with $\hat{ı} \in R^{p}$ and $ı_{j} = sign ({\hat{β}}_{j}^{L})$ if $j \in \hat{S}$ and ι_j ∈ [−1, +1] otherwise: the subdifferential which arises from ∥β∥₁. Rearranging (A.4) and observing that $sign ({\hat{β}}_{j}^{L}) = sign (h_{j})$ for j ∉ S yields

h^{' T} \hat{Σ} h \leq (u^{*} + λ_{L}) ‖ h_{S}^{'} ‖_{1} + (u^{*} - λ_{L}) ‖ h_{S^{c}}^{'} ‖_{1}

for all vectors h’ with $sign (h_{S^{c}}^{'}) = sign (h_{S^{c}})$ . If we take h’ = h, one can see that $h \in C (ξ, S)$ :

0 \leq h^{T} \hat{Σ} h \leq (u^{*} + λ_{L}) ‖ h_{S}^{'} ‖_{1} + (u^{*} - λ_{L}) ‖ h_{S^{c}}^{'} ‖_{1} \Rightarrow ‖ h_{S^{c}}^{'} ‖_{1} \leq \frac{(u^{*} + λ_{L})}{(λ_{L} - u^{*})} ‖ h_{S}^{'} ‖_{1} \leq ξ ‖ h_{S}^{'} ‖_{1} .

On the other hand, setting h′ to be any vector so that for some j ∈ S^c, $h_{j}^{'} = h_{j}$ and 0 elsewhere gives

h_{j} {\hat{Σ}}_{j,} . h \leq (u^{*} - λ_{L}) ∣ h_{j} ∣ \leq 0,

which implies that $h \in C_{-} (ξ, S)$ . The KKT conditions (A.4) also tell us that

‖ \hat{Σ} h ‖_{\infty} \leq u^{*} + λ_{L},

which, when combined with the definition of ζ (3.26) yields

‖ h ‖_{\infty} \leq \frac{u^{*} + λ_{L}}{ζ},

which is the desired result. In the event that u* ≤ λ_L(ξ – 1)/(ξ + 1), we have

‖ h ‖_{\infty} \leq \frac{2 λ_{L}}{ζ (ξ + 1)} .

A.2.3. Proof of Lemma 7

The proof is adapted from that of Theorem 3 in Sun and Zhang (2012). By construction, ${\hat{β}}^{L}$ satisfies the KKT conditions from (A.4) which implies that

\frac{∣ x_{j}^{T} X ({\hat{β}}^{L} - β^{*}) ∣}{N} = \frac{∣ x_{j}^{T} (y - X {\hat{β}}^{L} - ϵ) ∣}{N} \geq \frac{∣ x_{j}^{T} (y - X {\hat{β}}^{L} - ϵ) ∣}{N} - \frac{∣ x_{j}^{T} ϵ ∣}{N} \geq λ_{L} - u^{*} .

For $A \subseteq \hat{S} ∖ S$ , such that $∣ A ∣ \leq N^{'}$ , the previous inequality implies

(λ_{L} - u^{*})^{2} ∣ A ∣ \leq \frac{\sum_{j \in A} ∣ x_{j}^{T} X ({\hat{β}}^{L} - β^{*}) ∣^{2}}{N^{2}} = \frac{\sum_{j \in A} (X h)^{T} x_{j} x_{j}^{T} (X h)}{N^{2}} \leq \frac{κ_{+} (N^{'}, S) ‖ X h ‖_{2}^{2}}{N} .

(A.5)

Going back to the KKT conditions in (A.4), we have, for arbitrary $h^{'} \in R^{p}$ ,

\frac{(X {\hat{β}}^{L} - X h^{'})^{T} X h}{N} \leq λ_{L} (‖ h^{'} ‖_{1} - ‖ {\hat{β}}^{L} ‖_{1}) + u^{*} ‖ h^{'} - {\hat{β}}^{L} ‖_{1},

which, when combined with the fact that

2 (X {\hat{β}}^{L} - X h^{'})^{T} X h = ‖ X {\hat{β}}^{L} - X h^{'} ‖_{2}^{2} + ‖ X h ‖_{2}^{2} - ‖ X β^{*} - X h^{'} ‖_{2}^{2}

gives the inequality

\frac{‖ X h ‖_{2}^{2}}{N} \leq λ_{L} (‖ β^{*} ‖_{1} - ‖ {\hat{β}}^{L} ‖_{1}) + u^{*} ‖ h ‖_{1} \leq (λ_{L} + u^{*}) ‖ h_{S} ‖_{1} .

(A.6)

Thus, h lies in the cone in (3.25) in the event that u* ≤ λ_L(ξ – 1)/(ξ + 1) (by noting that the left-hand side is lower bounded by 0). By definition of κ(ξ, S) from (3.30),

\frac{‖ X h ‖_{2}^{2}}{N} \leq \frac{(λ_{L} + u^{*})^{2} d}{κ^{2} (ξ, S)},

which, when combined with (A.5) implies

∣ A ∣ \leq \frac{κ_{+} (N^{'}, S) ξ^{2} d}{κ^{2} (ξ, S)} < N^{'},

by Assumption 2.

A.2.4. Proof of Theorem 8

Suppose that u* ≤ λ_L(ξ − 1)/(ξ + 1). Then by Lemmas 6 and 7 and the referenced assumptions within, we have

‖ {\hat{β}}^{L} - β^{*} ‖_{\infty} \leq \frac{2 ξ λ_{L}}{(ξ + 1) ζ} \Rightarrow ∣ β_{j}^{*} ∣ \leq \frac{4 ξ λ_{L}}{(ξ + 1) ζ} for all j \in S ∖ \hat{S},

(A.7)

∣ \hat{S} ∖ S ∣ \leq N^{'} \Rightarrow ∣ \hat{S} ∣ \leq N^{'} + d ≲ d .

(A.8)

Denote $\hat{X} = [X_{\hat{S}} Z]$ . Under candidate model (3.18), our variance component estimators (via Henderson’s Method III) are given by

{\hat{σ}}^{2} = \frac{y^{T} (I_{N \times N} - P_{\hat{X}}) y}{N - rank (\hat{X})}, {\hat{τ}}^{2} = \frac{y^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) y - {\hat{σ}}^{2} [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr [Z^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z]} .

See (3.23) and (3.24).

Consider the more interesting scenario where $∣ S ∖ \hat{S} ∣ > 0$ . If S is contained within $\hat{S}$ , then it is straightforward to show that variance component estimators are consistent (the true model is a submodel of the proposed one). We first prove, under the given assumptions, that $∣ {\hat{σ}}^{2} - σ^{* 2} ∣ = o_{P} (1)$ . Write $S_{O} = S ∖ \hat{S}$ , ‘O’ for omitted. Then,

{\hat{σ}}^{2} = \frac{(\hat{X} β_{\hat{S}}^{*} + X_{S^{O}} β_{S^{O}}^{*} + ϵ)^{T} (I_{N \times N} - P_{\hat{X}}) (\hat{X} β_{\hat{S}} + X_{S^{O}} β_{S^{O}}^{*} + ϵ)}{N - rank (\hat{X})} = \frac{β_{S^{O}}^{* T} X_{S^{O}}^{T} (I_{N \times N} - P_{\hat{X}}) X_{S^{O}} β_{S^{O}}^{*}}{N - rank (\hat{X})} + \frac{2 β_{S^{O}}^{* T} X_{S^{O}}^{T} (I_{N \times N} - P_{\hat{X}}) ϵ}{N - rank (\hat{X})} + \frac{ϵ^{T} (I_{N \times N} - P_{\hat{X}}) ϵ}{N - rank (\hat{X})} .

(A.9)

We proceed to show that the three parts to (A.9) satisfy

∣ \frac{ϵ^{T} (I_{N \times N} - P_{\hat{X}}) ϵ}{N - rank (\hat{X})} - σ^{* 2} ∣ = o_{P} (1),

(A.10)

∣ \frac{2 β_{S^{O}}^{* T} X_{S^{O}}^{T} (I_{N \times N} - P_{\hat{X}}) ϵ}{N - rank (\hat{X})} ∣ = o_{P} (1),

(A.11)

and \frac{β_{S^{O}}^{* T} X_{S^{O}}^{T} (I_{N \times N} - P_{\hat{X}}) X_{S^{O}} β_{S^{O}}^{*}}{N - rank (\hat{X})} = o (1),

(A.12)

which would suggest that ${\hat{σ}}^{2}$ is indeed consistent for σ*². Note that the term in (A.12) is $Bias ({\hat{σ}}^{2}) = E [{\hat{σ}}^{2}] - σ^{* 2}$ .

Proving (A.11): Let

Γ_{{\hat{X}}_{⊥}} D_{{\hat{X}}_{⊥}} Γ_{{\hat{X}}_{⊥}}^{T}

represent the eigendecomposition of

I_{N \times N} - P_{\hat{X}}

. We note that the latter is idempotent, implying that the diagonal matrix

D_{{\hat{X}}_{⊥}}

, which is of rank

N - rank (\hat{X})

, has only 0 and 1s as its eigenvalues. It is straightforward to show that

E [\frac{2 β_{S^{O}}^{* T} X_{S^{O}}^{T} (I_{N \times N} - P_{\hat{X}}) ϵ}{N - rank (\hat{X})}] = 0 and Var [\frac{2 β_{S^{O}}^{* T} X_{S^{O}}^{T} (I_{N \times N} - P_{\hat{X}}) ϵ}{N - rank (\hat{X})}] = \frac{4 σ^{* 2} β_{S^{O}}^{* T} X_{S^{O}}^{T} Γ_{{\hat{X}}_{⊥}} D_{{\hat{X}}_{⊥}} Γ_{{\hat{X}}_{⊥}}^{T} X_{S^{O}} β_{S^{O}}^{*}}{[N - rank (\hat{X})]^{2}} \leq \frac{4 σ^{* 2} ‖ Γ_{{\hat{X}}_{⊥}}^{T} X_{S^{O}} β_{S^{O}}^{*} ‖_{\infty}^{2}}{N - rank (\hat{X})} \overset{≺}{\sim} \frac{4 σ^{* 2} ‖ Γ_{{\hat{X}}_{⊥}}^{T} X_{S^{O}} ‖_{\infty}^{2}}{N - rank (\hat{X})} \times \frac{d^{2} q \log p}{M} = o (1) .

Statement (A.11) then follows from Chebyshev’s inequality.

Proving (A.10): Let $χ_{i}^{2} (1)$ , $i = 1, \dots, N - rank (\hat{X})$ , be i.i.d random variables following a χ² distribution with 1 degree of freedom. Observe that,
$\frac{ϵ^{T} [I_{N \times N} - P_{\hat{X}}] ϵ}{N - rank (\hat{X})} =_{d} \frac{ϵ^{T} D_{{\hat{X}}_{⊥}} ϵ}{N - rank (\hat{X})} =_{d} \frac{σ^{* 2} \sum_{i = 1}^{N - rank (\hat{X})} χ_{i}^{2} (1)}{N - rank (\hat{X})}$ (A.13)

and
$∣ \frac{σ^{* 2} \sum_{i = 1}^{N - rank (\hat{X})} χ_{i}^{2} (1)}{N - rank (\hat{X})} - σ^{* 2} ∣ = o_{P} (1) .$ (A.14)

Here, (A.14) follows from Lemma 7 and the fact that N ≿ N′ + d + qM, which implies that $N - rank (\hat{X}) \to \infty$ as M α ∞. Applying the Strong Law of Large Numbers (SLLN) for i.i.d. random variables, we arrive at (A.10).
Proving (A.12): We observe that
$∣ Bias ({\hat{σ}}^{2}) ∣ \leq ‖ Γ_{{\hat{X}}_{⊥}}^{T} X_{S_{O}} ‖_{\infty}^{2} \times ‖ β_{S_{O}}^{*} ‖_{\infty}^{2} \times d^{2} ≲ \frac{d^{2} q \log (p)}{M} = o (1),$

and we have completed our proof that ${\hat{σ}}^{2}$ is consistent under the stated assumptions.

We now demonstrate that the same claim holds for ${\hat{τ}}^{2}$ . Expanding out y, we obtain, after some algebraic manipulation,

{\hat{τ}}^{2} = \frac{β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) X_{S_{O}} β_{S_{O}}^{*}}{tr (Z^{' T} Z^{'})} + \frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})} + \frac{ϵ^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})} + \frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})} + \frac{v^{T} Z^{' T} Z^{'} v}{tr (Z^{' T} Z^{'})} + \frac{2 v^{T} Z^{T} (I_{N \times N} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})} - \frac{σ^{* 2} [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr (Z^{' T} Z^{'})} - \frac{Bias ({\hat{σ}}^{2}) [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr (Z^{' T} Z^{'})},

(A.15)

where we have defined $Z^{'} = (I_{N \times N} - P_{X_{\hat{S}}}) Z$ . We set out to prove that the terms in (A.15) satisfy

∣ \frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})} ∣ = o_{P} (1),

(A.16)

∣ \frac{ϵ^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})} - \frac{σ^{* 2} [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr [Z^{' T} Z^{'}]} ∣ = o_{P} (1),

(A.17)

∣ \frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})} ∣ = o_{P} (1),

(A.18)

∣ \frac{v^{T} Z^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})} - τ^{* 2} ∣ = o_{P} (1),

(A.19)

∣ \frac{2 v^{T} Z^{T} (I_{N \times N} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})} ∣ = o_{P} (1),

(A.20)

and

\frac{β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) X_{S_{O}} β_{S_{O}}^{*}}{tr (Z^{' T} Z^{'})} - \frac{Bias ({\hat{σ}}^{2}) [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr (Z^{' T} Z^{'})} = o (1) .

(A.21)

Let $Q_{Z^{'}} D_{Z^{'}} Γ_{Z^{'}}^{T}$ represent the singular value decomposition of Z′, with Q_Z′ and Γ_Z′ of dimensions N × qM and qM × qM, respectively. Additionally, write the eigendecompositions of $P_{\hat{X}} - P_{X_{\hat{S}}}$ and $I_{N \times N} - P_{X_{\hat{S}}}$ as $Γ_{\hat{X} ⊥ X_{\hat{S}}} D_{\hat{X} ⊥ X_{\hat{S}}} Γ_{\hat{X} ⊥ X_{\hat{S}}}^{T}$ and $Γ_{X_{\hat{S}} ⊥} D_{X_{\hat{S}} ⊥} Γ_{X_{\hat{S}} ⊥}^{T}$ , respectively. Note that (A.19) and (A.21) make up $Bias [{\hat{τ}}^{2}] = E [{\hat{τ}}^{2}] - τ^{* 2}$ .

To avoid repetition, some of the proofs are presented in abbreviated form.

Proving (A.16): Clearly,

E [\frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})}] = 0 Var [\frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})}] = \frac{4 β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) X_{S_{O}} β_{S_{O}}^{*}}{tr [Z^{' T} Z^{'}]^{2}} = \frac{4 β_{S_{O}}^{* T} X_{S_{O}}^{T} Γ_{\hat{X} ⊥ X_{\hat{S}}} D_{\hat{X} ⊥ X_{\hat{S}}} Γ_{\hat{X} ⊥ X_{\hat{S}}}^{T} X_{S_{O}} β_{S_{O}}^{*}}{tr [Z^{' T} Z^{'}]^{2}} \leq \frac{4 \times d^{2} \times q M \times ‖ Γ_{\hat{X} ⊥ X_{\hat{S}}} X_{S_{O}} ‖_{\infty} \times ‖ β_{S_{O}}^{*} ‖_{\infty}^{2}}{tr [Z^{' T} Z^{'}]^{2}} = o (1),

following from (3.32) in Assumption 3 and Assumption 4.

Proving (A.17): Orthogonality of $Γ_{\hat{X} ⊥ X_{\hat{S}}}$ implies that $ϵ^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ =_{d} ϵ^{T} D_{\hat{X} ⊥ X_{\hat{S}}} ϵ$ , so
$E [\frac{ϵ^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})}] = E [\frac{ϵ^{T} D_{\hat{X} ⊥ X_{\hat{S}}} ϵ}{tr (Z^{' T} Z^{'})}] = \frac{σ^{* 2} [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr [Z^{' T} Z^{'}]}$

and, using properties of quadratic forms, we have
$Var [\frac{ϵ^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) ϵ}{tr (Z^{' T} Z^{'})}] = Var [\frac{ϵ^{T} D_{\hat{X} ⊥ X_{\hat{S}}} ϵ}{tr (Z^{' T} Z^{'})}] = \frac{2 σ^{* 4} rank (D_{\hat{X} ⊥ X_{\hat{S}}})}{tr (Z^{' T} Z^{'})^{2}} = \frac{2 σ^{* 4} rank (Z^{'})}{tr (Z^{' T} Z^{'})^{2}} ≲ \frac{1}{M} = o (1),$
the latter relation following from (3.33) in Assumption 3. This proves (A.17).

Proving (A.18): Proof is similar to that of (A.16), as we note that

E [\frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})}] = 0, Var [\frac{2 β_{S_{O}}^{* T} X_{S_{O}}^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})}] = \frac{4 τ^{* 2} β_{S_{O}}^{* T} X_{S_{O}}^{T} Q_{Z^{'}} D_{Z^{'}}^{2} Q_{Z^{'}}^{T} X_{S_{O}} β_{S_{O}}^{*}}{tr (Z^{' T} Z^{'})^{2}} \leq \frac{4 τ^{* 2} tr (Z^{' T} Z) ‖ Q_{Z^{'}}^{T} X_{S_{O}} β_{S_{O}}^{*} ‖_{\infty}^{2}}{tr (Z^{' T} Z^{'})^{2}} = o (1),

having applied Assumptions 3 and 4 here.

Proving (A.19): As for (A.17), using again properties of quadratic forms, we can show that
$E [\frac{v^{T} Z^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})}] = τ^{* 2}, Var [\frac{v^{T} Z^{T} (I_{N \times N} - P_{X_{\hat{S}}}) Z v}{tr (Z^{' T} Z^{'})}] = \frac{2 τ^{* 4} tr (D_{Z^{'}}^{4})}{tr (D_{Z^{'}}^{2})^{2}} = o (1),$
the last relation the result of (3.33) from Assumption 3.
Proving (A.20): We can rewrite
$\frac{2 v^{T} Z^{' T} ϵ}{tr (Z^{' T} Z^{'})} = \frac{2 v^{T} Γ_{Z^{'}} D_{Z^{'}} Q_{Z^{'}}^{T} ϵ}{tr (Z^{' T} Z^{'})} =_{d} \frac{σ^{*} τ^{*} \sum_{i = 1}^{q M} s_{i} B_{i}}{\sum_{i = 1}^{q M} s_{i}^{2}} .$
where B_i, i = 1,… rank(Z′) are random variables formed as the product of two independent N(0, 1) random variables. Then,
$Var (\sum_{i = 1}^{q M} s_{i} B_{i}) = \sum_{i = 1}^{q M} s_{i}^{2},$
which implies that
$Var (\frac{σ^{*} τ^{*} \sum_{i = 1}^{q M} s_{i} B_{i}}{\sum_{i = 1}^{q M} s_{i}^{2}}) = \frac{1}{\sum_{i = 1}^{q M} s_{i}^{2}}$

and since $\sum_{i = 1}^{q M} s_{i}^{2} ≍ q M$ by (3.32) in Assumption 3, we have proven our claim (A.20).

Proving (A.21): By the definition of

Bias [{\hat{τ}}^{2}]

, it is clear that

∣ Bias ({\hat{τ}}^{2}) ∣ \leq \frac{β_{S_{O}}^{* T} X_{S_{O}}^{T} (P_{\hat{X}} - P_{X_{\hat{S}}}) X_{S_{O}} β_{S_{O}}^{*}}{tr (Z^{' T} Z^{'})} + \frac{∣ Bias ({\hat{σ}}^{2}) ∣ [rank (\hat{X}) - rank (X_{\hat{S}})]}{tr (Z^{' T} Z^{'})} \leq \frac{β_{S_{O}}^{* T} X_{S_{O}}^{T} P_{Z} X_{S_{O}} β_{S_{O}}^{*}}{tr (Z^{' T} Z^{'})} + \frac{q M \times ∣ Bias ({\hat{σ}}^{2}) ∣}{tr (Z^{' T} Z^{'})} \leq \frac{q M \times d \times ‖ Γ_{Z} X_{S^{O}} ‖_{\infty}^{2} \times ‖ β_{S_{O}}^{*} ‖_{\infty}^{2}}{tr (Z^{' T} Z^{'})} + \frac{q M \times Bias ({\hat{σ}}^{2})}{tr (Z^{' T} Z^{'})} ≲ \frac{d^{2} q \log (p)}{M} = o (1),

where the second last relation follows from the proven claim that

Bias ({\hat{σ}}^{2})

is o(1), (3.32) in Assumption 3 and Assumption 4.

Since the event u* ≤ λ_L(ξ − 1)/(ξ + 1) occurs with probability greater than 1 – 1/p → 1 as p → ∞,

∣ {\hat{σ}}^{2} - σ^{* 2} ∣ = o_{P} (1), ∣ {\hat{τ}}^{2} - τ^{* 2} ∣ = o_{P} (1)

as claimed.

A.2.5. Proof of Theorem 9

Suppose that u* ≤ λ_L(ξ − 1)/(ξ + 1), and write $S_{O} = S ∖ \hat{S}$ . The OLS fit ${\hat{β}}^{init}$ has a simple closed-form expression:

{\hat{β}}_{\hat{S}}^{init} = (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} y = β_{\hat{S}}^{*} + (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} X_{S_{O}} β_{S_{O}}^{*} + (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} (y - X β^{*}) .

and ${\hat{β}}_{{\hat{S}}^{c}}^{init} = 0$ . Thus, by triangle inequality,

‖ {\hat{β}}_{\hat{S}}^{init} - β_{\hat{S}}^{*} ‖_{1} \leq ‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} X_{S_{O}} β_{S_{O}}^{*} ‖_{1} + ‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} (y - X β^{*}) ‖_{1} .

(A.22)

We proceed by first bounding the first term on the right-hand side of (A.22). By Assumption 5,

‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} ‖_{2} \leq \sqrt{ν_{\max} [(X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} X_{\hat{S}} (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1}]} = \sqrt{ν_{\max} [(X_{\hat{S}}^{T} X_{\hat{S}})^{- 1}]} = \sqrt{\frac{1}{N} ν_{\max} [(X_{\hat{S}}^{T} X_{\hat{S}} ∕ N)^{- 1}]} = \sqrt{\frac{1}{N} ∕ ν_{min} (X_{\hat{S}}^{T} X_{\hat{S}} ∕ N)} \leq \frac{1}{\sqrt{N ψ_{-} (N^{'}, S)}} ≲ \frac{1}{\sqrt{N}} .

This, in turn, implies that

‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} X_{S_{O}} β_{S_{O}}^{*} ‖_{1} \leq \sqrt{N^{'} + d} ‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} X_{S_{O}} β_{S_{O}}^{*} ‖_{2} \leq \sqrt{N^{'} + d} ‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} ‖_{2} ‖ X_{S_{O}} β_{S_{O}}^{*} ‖_{2} \leq \sqrt{N^{'} + d} ‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} ‖_{2} ‖ X_{S_{O}} ‖_{F} ‖ β_{S_{O}}^{*} ‖_{\infty} \leq \sqrt{\frac{(N^{'} + ∣ S ∣) d}{ψ_{-} (N^{'}, S)}} \frac{2 ξ λ_{L}}{(ξ + 1) ζ} ≲ \sqrt{\frac{d^{2} q \log (p)}{M}} = o (1),

where the last relation follows from Assumption 2.

We proceed to bound the second component on the right-hand side of (A.22). We observe that

‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} ϵ ‖_{1} \leq \sqrt{N^{'} + d} ‖ (X_{\hat{S}}^{T} X_{\hat{S}})^{- 1} X_{\hat{S}}^{T} (y - X β^{*}) ‖_{2} \leq \sqrt{N^{'} + d} {‖ (X_{\hat{S}}^{T} X_{\hat{S}} ∕ N)^{- 1} \frac{X_{\hat{S}}^{T} (y - X β^{*})}{N} ‖}_{2} \leq \sqrt{N^{'} + d} {‖ ∣ (X_{\hat{S}}^{T} X_{\hat{S}} ∕ N)^{- 1} ∣ ‖}_{2} {‖ \frac{X_{\hat{S}}^{T} (y - X β^{*})}{N} ‖}_{2} \leq (N^{'} + d) {‖ ∣ (X_{\hat{S}}^{T} X_{\hat{S}} ∕ N)^{- 1} ∣ ‖}_{2} {‖ \frac{X_{\hat{S}}^{T} (y - X β^{*})}{N} ‖}_{\infty} \leq \frac{N^{'} + d}{κ_{-} (N^{'}, S)} λ_{L} ≲ \sqrt{\frac{d^{2} q \log (p)}{M}} = o (1) .

From Lemma 6,

‖ {\hat{β}}_{S_{O}}^{init} - β_{S_{O}}^{*} ‖_{\infty} = ‖ β_{S_{O}}^{*} ‖_{\infty} ≲ \sqrt{\frac{q \log (p)}{M}},

which implies that

‖ {\hat{β}}_{S_{O}}^{init} - β_{S_{O}}^{*} ‖_{1} ≲ \sqrt{\frac{d^{2} q \log (p)}{M}} = o (1) .

By Lemma 11, the event u* ≤ λ_L(ξ − 1)/(ξ + 1) occurs with probability exceeding 1 – 1/p. Combined, we obtain the desired result.

A.3. Empirical Evaluation of Assumption 4

To assess the stringency of Assumption 4 compared to the irrepresentability condition (Zhao and Yu, 2006), we conduct a simulation study similar to Zhao and Yu (2006), but customized to our mixed linear model setting.

Consider the model

y = X β^{*} + Z ν + ε,

with $X \in R^{n M \times p}$ and $Z \in R^{n M \times q M}$ , $β \in R^{p}$ , $ν \in R^{p}$ and $ε \in R^{n M}$ . We consider q = 2 random effects, M = 25 groups, and n = 20 samples within each group. Among the p fixed effect covariates, we set d to have nonzero coefficients and the rest to have zero coefficients. More specifically, we set $β^{*} = (\underset{d}{\underset{︸}{1, \dots, 1}}, \underset{p - d}{\underset{︸}{0, \dots, 0}})^{T}$ .

To assess the stringency of the two assumptions, in each of B = 1000 simulation replications, we randomly generate design matrices X and Z jointly as [X, Z_u] ~_iid N(0, Σ), where Z_u is the un-blocked version of Z. The covariance matrix Σ is generated from a Wishart(p+q, I_p+q) distribution, and X and Z are scaled such that $‖ x_{j} ‖_{2}^{2} = n M$ and $‖ z_{j} ‖_{2}^{2} = n$ . Following Zhao and Yu (2006), we consider p = 2^k for k ∈ {3, 4,…, 8}, and set d = tp/8 for t ∈ {1, 2,…, 8}.

Let A* be the index of the true active set (hence ∣A*∣ = d). Further, let S ⊂ (A*)^c with ∣S∣ = min(p – d, p) be a random subset of variables with zero coefficients. For j ∈ A*, let $\tilde{X} = [X_{A^{*} ∖ {j}}, X_{S}, Z]$ be the augmented design matrix, and denote $\hat{Σ} = {\hat{X}}^{T} \hat{X} ∕ N$ .

With the above notations, the irrepresentability condition is satisfied if T_IR = max_j∈A* T_IR,j < 1, where

T_{IR, j} = ‖ {\hat{Σ}}_{(A^{*} ∖ {j})^{c}, A^{*} ∖ {j}} {({\hat{Σ}}_{(A^{*} ∖ {j})^{c}, A^{*} ∖ {j}})}^{- 1} sign (β_{A}^{*} ∖ {j}) ‖_{\infty} .

Assumption 4 involves a related quantity, $T_{4, j} = ‖ Γ_{\tilde{X}} x_{j} ‖_{\infty}$ , where $Γ_{\tilde{X}} D_{\tilde{X}} Γ_{\tilde{X}}^{T}$ is the eigen-decomposition of $\tilde{X} {(\tilde{X} {\tilde{X}}^{T})}^{- 1} {\tilde{X}}^{T}$ . However, this assumption is satisfied if T₄ ≡ max_j∈A* T_4,j = O(1). Thus, to satisfy Assumption 4, we need a constant C, not dependent on N and p, such that T₄ < C. While C can be any large but fixed constant, in this simulation we consider a moderate value of C = 5.

The proportion of simulated data sets, where the irrepresentability assumption and Assumption 4 are satisfied are shown in Tables 2 and 3, respectively. As in Zhao and Yu (2006), the results in Table 2 indicate that the irrepresentability assumption can be stringent, especially as the dimension p and the number of nonzero coefficients d increase. In contrast, the results in Table 3 suggest that, for C = 5, Assumption 4 is much more likely to hold. Moreover, the proportion of cases for which this assumption holds does not change with p or d. While the appropriate choice of C is generally unknown, the results in this simulation suggest that even with moderate values (in this case C = 5) Assumption 4 is likely satisfied.

Table 2:

Proportions of cases satisfying the irrepresentability condition (Zhao and Yu, 2006).

T_IR < 1	p = 8	p = 16	p = 32	p = 64	p = 128	p = 256
d = p/8	1	1	0.975	0.823	0.327	0.022
d = 2p/8	1	0.735	0.283	0.013	0	0
d = 3p/8	0.736	0.231	0.014	0	0	0
d = 4p/8	0.356	0.062	0.001	0	0	0
d = 5p/8	0.244	0.024	0	0	0	0
d = 6p/8	0.208	0.015	0	0	0	0
d = 7p/8	0.199	0.012	0	0	0	0

Open in a new tab

Table 3:

Proportions of cases satisfying Assumption 4.

T₄ < 5	p = 8	p = 16	p = 32	p = 64	p = 128	p = 256
d = p/8	0.998	0.999	0.997	0.996	0.994	0.99
d = 2p/8	1	1	0.996	0.995	0.99	0.982
d = 3p/8	0.998	0.999	0.994	0.992	0.985	0.973
d = 4p/8	0.999	0.998	0.991	0.995	0.984	0.954
d = 5p/8	0.996	0.995	0.993	0.989	0.976	0.938
d = 6p/8	1	0.997	0.993	0.987	0.967	0.93
d = 7p/8	0.996	0.995	0.995	0.988	0.956	0.927

Open in a new tab

References

Bühlmann P. Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242, 2013. ISSN 1350-7265. doi: 10.3150/12-BEJSP11. URL http://dx.doi.org.offcampus.lib.washington.edu/10.3150/12-BEJSP11. [DOI] [Google Scholar]
Bühlmann P, Kalisch M, and Meier L. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014. [Google Scholar]
Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340, 1977. ISSN 0162-1459. URL http://links.jstor.org.offcampus.lib.washington.edu/sici?sici=0162-1459(197706)72:358<320:MLATVC>2.0.CO;2-9&origin=MSN. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]
Henderson CR. Estimation of variance and covariance components. Biometrics, 9:226–252, 1953. ISSN 0006-341X. doi: 10.2307/3001853. URL http://dx.doi.org.offcampus.lib.washington.edu/10.2307/3001853. [DOI] [Google Scholar]
Javanmard A and Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. URL http://jmlr.org/papers/v15/javanmard14a.html. [Google Scholar]
Jiang J. REML estimation: asymptotic behavior and related topics. Ann. Statist, 24(1):255–286, 1996. ISSN 0090-5364. URL 10.1214/aos/1033066209. [DOI] [Google Scholar]
Lee JD, Sun DL, Sun Y, and Taylor JE. Exact post-selection inference, with application to the lasso. Ann. Statist, 44(3):907–927, 2016. ISSN 0090-5364. doi: 10.1214/15-AOS1371. URL 10.1214/15-AOS1371. [DOI] [Google Scholar]
Li Y, Wang S, Song PX-K, Wang N, Zhou L, and Zhu J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Interface, 11(4):721–737, 2018. ISSN 1938-7989. doi: 10.4310/SII.2018.v11.n4.a15. URL 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lockhart R, Taylor J, Tibshirani RJ, and Tibshirani R. A significance test for the lasso. Ann. Statist, 42(2):413–468, 2014. ISSN 0090-5364. doi: 10.1214/13-AOS1175. URL 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meinshausen N and Bühlmann P. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol, 72 (4):417–473, 2010. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2010.00740.x. URL 10.1111/j.1467-9868.2010.00740.x. [DOI] [Google Scholar]
Meinshausen N, Meier L, and Bühlmann P. p-values for high-dimensional regression. J. Amer. Statist. Assoc, 104(488):1671–1681, 2009. ISSN 0162-1459. doi: 10.1198/jasa.2009.tm08647. URL 10.1198/jasa.2009.tm08647. [DOI] [Google Scholar]
Ning Y and Liu H. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist, 45(1):158–195, 2017. ISSN 0090-5364. doi: 10.1214/16-AOS1448. URL 10.1214/16-AOS1448. [DOI] [Google Scholar]
Schelldorfer J, Bühlmann P, and van de Geer S. Estimation for high-dimensional linear mixed-effects models using ℓ₁-penalization. Scand. J. Stat, 38(2):197–214, 2011. ISSN 0303-6898. doi: 10.1111/j.1467-9469.2011.00740.x. URL 10.1111/j.1467-9469.2011.00740.x. [DOI] [Google Scholar]
Searle SR. Another look at henderson’s methods of estimating variance components. Biometrics, 24(4):749–787, 1968. ISSN 0006341X, 15410420. URL http://www.jstor.org/stable/2528870. [Google Scholar]
Shah RD and Samworth RJ. Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(1):55–80, 2013. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2011.01034.x. URL 10.1111/j.1467-9868.2011.01034.x. [DOI] [Google Scholar]
Shao J and Deng X. Estimation in high-dimensional linear models with deterministic design matrices. Ann. Statist, 40(2):812–831, 2012. ISSN 0090-5364. doi: 10.1214/12-AOS982. URL 10.1214/12-AOS982. [DOI] [Google Scholar]
Sun T and Zhang C-H. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. ISSN 0006-3444. doi: 10.1093/biomet/ass043. URL 10.1093/biomet/ass043. [DOI] [Google Scholar]
Tibshirani RJ, Taylor J, Lockhart R, and Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. ArXiv e-prints, Jan. 2014. [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y, and Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3):1166–1202, 2014. ISSN 0090-5364. doi: 10.1214/14-AOS1221. URL 10.1214/14-AOS1221. [DOI] [Google Scholar]
Wasserman L and Roeder K. High-dimensional variable selection. Ann. Statist, 37(5A):2178–2201, 2009. ISSN 0090-5364. doi: 10.1214/08-AOS646. URL 10.1214/08-AOS646. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ye F and Zhang C-H. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. J. Mach. Learn. Res, 11:3519–3540, Dec. 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953043. [Google Scholar]
Yu Y, Bradic J, and Samworth RJ. Confidence intervals for high-dimensional Cox models. ArXiv e-prints, Sept. 2018. [Google Scholar]
Zhang C-H and Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol, 76(1):217–242, 2014. ISSN 1369-7412. doi: 10.1111/rssb.12026. URL 10.1111/rssb.12026. [DOI] [Google Scholar]
Zhao P and Yu B. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541–2563, 2006. [Google Scholar]

[R1] Bühlmann P. Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242, 2013. ISSN 1350-7265. doi: 10.3150/12-BEJSP11. URL http://dx.doi.org.offcampus.lib.washington.edu/10.3150/12-BEJSP11. [DOI] [Google Scholar]

[R2] Bühlmann P, Kalisch M, and Meier L. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014. [Google Scholar]

[R3] Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340, 1977. ISSN 0162-1459. URL http://links.jstor.org.offcampus.lib.washington.edu/sici?sici=0162-1459(197706)72:358<320:MLATVC>2.0.CO;2-9&origin=MSN. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]

[R4] Henderson CR. Estimation of variance and covariance components. Biometrics, 9:226–252, 1953. ISSN 0006-341X. doi: 10.2307/3001853. URL http://dx.doi.org.offcampus.lib.washington.edu/10.2307/3001853. [DOI] [Google Scholar]

[R5] Javanmard A and Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. URL http://jmlr.org/papers/v15/javanmard14a.html. [Google Scholar]

[R6] Jiang J. REML estimation: asymptotic behavior and related topics. Ann. Statist, 24(1):255–286, 1996. ISSN 0090-5364. URL 10.1214/aos/1033066209. [DOI] [Google Scholar]

[R7] Lee JD, Sun DL, Sun Y, and Taylor JE. Exact post-selection inference, with application to the lasso. Ann. Statist, 44(3):907–927, 2016. ISSN 0090-5364. doi: 10.1214/15-AOS1371. URL 10.1214/15-AOS1371. [DOI] [Google Scholar]

[R8] Li Y, Wang S, Song PX-K, Wang N, Zhou L, and Zhu J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Interface, 11(4):721–737, 2018. ISSN 1938-7989. doi: 10.4310/SII.2018.v11.n4.a15. URL 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Lockhart R, Taylor J, Tibshirani RJ, and Tibshirani R. A significance test for the lasso. Ann. Statist, 42(2):413–468, 2014. ISSN 0090-5364. doi: 10.1214/13-AOS1175. URL 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Meinshausen N and Bühlmann P. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol, 72 (4):417–473, 2010. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2010.00740.x. URL 10.1111/j.1467-9868.2010.00740.x. [DOI] [Google Scholar]

[R11] Meinshausen N, Meier L, and Bühlmann P. p-values for high-dimensional regression. J. Amer. Statist. Assoc, 104(488):1671–1681, 2009. ISSN 0162-1459. doi: 10.1198/jasa.2009.tm08647. URL 10.1198/jasa.2009.tm08647. [DOI] [Google Scholar]

[R12] Ning Y and Liu H. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist, 45(1):158–195, 2017. ISSN 0090-5364. doi: 10.1214/16-AOS1448. URL 10.1214/16-AOS1448. [DOI] [Google Scholar]

[R13] Schelldorfer J, Bühlmann P, and van de Geer S. Estimation for high-dimensional linear mixed-effects models using ℓ₁-penalization. Scand. J. Stat, 38(2):197–214, 2011. ISSN 0303-6898. doi: 10.1111/j.1467-9469.2011.00740.x. URL 10.1111/j.1467-9469.2011.00740.x. [DOI] [Google Scholar]

[R14] Searle SR. Another look at henderson’s methods of estimating variance components. Biometrics, 24(4):749–787, 1968. ISSN 0006341X, 15410420. URL http://www.jstor.org/stable/2528870. [Google Scholar]

[R15] Shah RD and Samworth RJ. Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(1):55–80, 2013. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2011.01034.x. URL 10.1111/j.1467-9868.2011.01034.x. [DOI] [Google Scholar]

[R16] Shao J and Deng X. Estimation in high-dimensional linear models with deterministic design matrices. Ann. Statist, 40(2):812–831, 2012. ISSN 0090-5364. doi: 10.1214/12-AOS982. URL 10.1214/12-AOS982. [DOI] [Google Scholar]

[R17] Sun T and Zhang C-H. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. ISSN 0006-3444. doi: 10.1093/biomet/ass043. URL 10.1093/biomet/ass043. [DOI] [Google Scholar]

[R18] Tibshirani RJ, Taylor J, Lockhart R, and Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. ArXiv e-prints, Jan. 2014. [Google Scholar]

[R19] van de Geer S, Bühlmann P, Ritov Y, and Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3):1166–1202, 2014. ISSN 0090-5364. doi: 10.1214/14-AOS1221. URL 10.1214/14-AOS1221. [DOI] [Google Scholar]

[R20] Wasserman L and Roeder K. High-dimensional variable selection. Ann. Statist, 37(5A):2178–2201, 2009. ISSN 0090-5364. doi: 10.1214/08-AOS646. URL 10.1214/08-AOS646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Ye F and Zhang C-H. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. J. Mach. Learn. Res, 11:3519–3540, Dec. 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953043. [Google Scholar]

[R22] Yu Y, Bradic J, and Samworth RJ. Confidence intervals for high-dimensional Cox models. ArXiv e-prints, Sept. 2018. [Google Scholar]

[R23] Zhang C-H and Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol, 76(1):217–242, 2014. ISSN 1369-7412. doi: 10.1111/rssb.12026. URL 10.1111/rssb.12026. [DOI] [Google Scholar]

[R24] Zhao P and Yu B. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541–2563, 2006. [Google Scholar]

PERMALINK

Statistical significance in high-dimensional linear mixed models

Lina Lin

Mathias Drton

Ali Shojaie

Abstract

1. Introduction

Notation

2. The linear mixed effect model

3. A ridge-based inferential framework

3.1. A de-biased ridge estimator

3.2. Consistent estimation of variance parameters

3.3. An initial estimator for β* and our choice of Cj

4. Numerical experiments

4.1. A practical choice for λL

4.2. A look into p-values

Figure 1:

Figure 2:

4.3. Comparisons with existing methods

Figure 3:

Table 1:

5. An application to riboflavin production data

6. Discussion

A. Appendix

A.1. Proof of Results in Section 3

A.1.1. Proof of Lemma 1

A.1.2. Proof of Proposition 2

A.2. Proof of Theorems 4, 8 and 9

A.2.1. Proof of Theorem 4

A.2.2. Proof of Lemma 6

A.2.3. Proof of Lemma 7

A.2.4. Proof of Theorem 8

A.2.5. Proof of Theorem 9

A.3. Empirical Evaluation of Assumption 4

Table 2:

Table 3:

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

3.3. An initial estimator for β* and our choice of C_j

4.1. A practical choice for λ_L