Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2022 Apr 29.
Published in final edited form as: FODS 20 (2020). 2020 Oct 18;2020:171–181. doi: 10.1145/3412815.3416883

Statistical significance in high-dimensional linear mixed models

Lina Lin a, Mathias Drton b, Ali Shojaie c
PMCID: PMC9053448  NIHMSID: NIHMS1735582  PMID: 35497571

Abstract

This paper concerns the development of an inferential framework for high-dimensional linear mixed effect models. These are suitable models, for instance, when we have n repeated measurements for M subjects. We consider a scenario where the number of fixed effects p is large (and may be larger than M), but the number of random effects q is small. Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators to perform inference for high-dimensional linear models with fixed effects only. In particular, we demonstrate how to correct a ‘naive’ ridge estimator in extension of work by Bühlmann (2013) to build asymptotically valid confidence intervals for mixed effect models. We validate our theoretical results with numerical experiments, in which we show our method outperforms those that fail to account for correlation induced by the random effects. For a practical demonstration we consider a riboflavin production dataset that exhibits group structure, and show that conclusions drawn using our method are consistent with those obtained on a similar dataset without group structure.

1. Introduction

Modern statistical problems are increasingly high-dimensional, with the number of covariates p potentially vastly exceeding the sample size N. This is due in part to technological advances that facilitate data collection. For instance, we are now able to measure the expression of many genes in a given specimen at little cost. However, it often remains expensive to have many replicates/species to experiment on, resulting in Np.

Fortunately, significant progress has been made in developing rigorous statistical tools for tackling such problems. While earlier work largely targeted point estimation and/or variable selection, recent years have seen a number of proposals on how to also assign uncertainty, statistical significance and confidence in high-dimensional models. This is of great practical importance, particularly when interpretation of parameters and variables is of key priority.

Early attempts are highly varied in their approach. Stability selection was proposed by Meinshausen and Bühlmann (2010) as a generic method for controlling the expected number of false positive selections; with improvements given by Shah and Samworth (2013). Sample splitting, where a first subsample is used to screen, and a second subsample is used to perform inference (Wasserman and Roeder, 2009; Meinshausen et al., 2009) has also been explored. Taking an alternative approach, Lockhart et al. (2014), Tibshirani et al. (2014) and Lee et al. (2016) build a framework for conditional inference for high-dimensional linear models, i.e., conduct inference given some covariates have been selected.

In this paper, we propose an unconditional inferential framework for high-dimensional linear mixed effect models, with the goal of testing null hypotheses of the form

H0,G:βj=0for alljG (1.1)

where βRp is the vector of fixed effect regression coefficients, and G may be any subset of {1,…,p}. Of particular interest is the case G = {j}, i.e., testing if a single fixed effect coefficient βj is zero. A related goal is to construct confidence intervals for βj, j = 1,…,p. This problem arises naturally in many settings, as observations are rarely independent. A prime example is the analysis of longitudinal data, which is highly prevalent in clinical studies. In such settings mixed effect models are a natural extension of linear models for modeling data exhibiting group-structured dependence.

Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators as an approach to inference for high-dimensional linear models with fixed effects only. There, the limiting distribution of the modified estimator is tractable and, thus, can be used to construct approximate p-values and confidence intervals. For example, in high-dimensional linear regression, Zhang and Zhang (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014) suggest de-sparsifying the lasso: starting with the biased lasso estimator, the authors ‘invert’ the corresponding Karuhn-Kush-Tucker (KKT) optimality conditions to form an estimator that is approximately unbiased for β* and normally distributed. By construction, the de-biased estimator can then be used to derive confidence intervals and p-values. Ning and Liu (2017) extended this strategy by developing a score test for inference in penalized M-estimators.

Our proposed method bears strongest resemblance to Bühlmann (2013). Developed for high-dimensional linear models, the framework of Bühlmann (2013) is similar to those put forth by Zhang and Zhang (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014), except it uses ridge estimation as a starting point. While the overall framework we consider is similar, there are important differences in the specifics on how to correct — or rather, approximately correct — for the bias in the ridge estimator, and how to compute an approximation of the limiting distribution of the de-biased estimator, to construct p-values and confidence intervals for elements in β*. As will be evident later, these differences are direct results of having to cope with dependencies induced by the random effects in the linear mixed effect model. The naive treatment of ignoring the dependencies, as we demonstrate in numerical examples, leads to poor practical performance (particularly, when inverting estimator to obtain confidence intervals, the confidence intervals have insufficient coverage). We address this issue by introducing a two-stage procedure that yields consistent estimates of the parameters that determine these dependencies. While we describe a ridge-based framework, the methodology could be extended to make use of other high-dimensional estimators as the starting point for constructing a de-biased estimator.

Our decision to use a ridge estimator is based on simulation findings for standard linear models showing that while asymptotically optimal, confidence intervals from 1-based de-biasing (Zhang and Zhang, 2014; van de Geer et al., 2014; Javanmard and Montanari, 2014) tend to have coverage problems in finite samples. Yu et al. (2018) similarly noticed that confidence intervals based on a de-biased 1-estimator for high-dimensional Cox model had poorer than theoretical coverage in practice. Although its theoretical justification is similar, the ridge-based method of Bühlmann (2013) yields better finite-sample error control.

Our paper is organized as follows. The remainder of this section provides a brief overview of the subsequent notation. Section 2 makes explicit the form of the high-dimensional linear mixed effect model we are working with. In Section 3, we describe the details of our method: specifically, how it builds upon Bühlmann (2013) to accommodate dependence within groups induced by the random effects. We also present theory, along with the required assumptions, to justify it. Numerical experiments can be found in Section 4, followed by a practical application of the method in Section 5. We conclude with a discussion and elaborate on potential extensions in Section 6. Proofs are collected in the Appendix.

Notation

Matrices are written in upper-case bold-face and their entries in corresponding lower-case. So ajk is the (j, k)th entry of matrix ARn1×n2. For j ∈ {1,…,n2} and J ⊆ {1,…,n2}, aj and AJ denote the jth column of A and the column-wise concatenation of columns in A indexed by the set J, respectively. The ith row of A is denoted a(i). For r ∈ [1, ∞], the r norm of a vector uRn is ur=(i=1puir)1r, and the induced norm of a matrix ARn1×n2 is Ar=sup{Axr:xRn2,xr=1}. With this notation, ⫼A2 is the spectral norm, ⫼A1 the maximum absolute column sum of the matrix, and ⫼A the maximum absolute row sum of the matrix. We use ∥Ar to denote the r norm of the vectorization of A.

The projection of Rn2 onto the linear space generated by the rows of A is denoted PA = A(ATA)AT, where A is the Moore-Penrose inverse of A. For square matrices A1 and A2 of the same dimensions, A1A2 indicates that A2A1 is positive semi-definite.

For real-valued functions g1(x) and g2(x) defined on (0, ∞), we write g1(x) ≲ g2(x) if there is a constant c ∈ (0, ∞) such that g1(x) ≤ cg2(x), and g1(x) ≳ g2(x) if instead g1(x) ≥ cg2(x). We write g1(x) ≍ g2(x) if both g1(x) ≲ g2(x) and g1(x) ≳ g2(x). Then, g1(x) = o(g2(x)) if g1(x)/g2(x) → 0 as x → ∞, and g1(x) = O(g2(x)) if there is a c ∈ (0, ∞) such that ∣g1(x)∣ ≤ cg2(x) for all x large enough. The latter relations also apply when x is a vector, where x → ∞ is interpreted elementwise. Finally, if XR is a random variable and aR is some constant, we write ∣Xa∣ = oP(1) if X converges to a in probability, i.e., Xp a.

2. The linear mixed effect model

Consider M groups of observations of sizes n1,…,nM. Let m = 1,…, M be group indices, and let i = 1,…,nm index the observations within group m. Let N be the total number of observations, so N=m=1Mnm. We may later assume, without loss of generality, that nm = n for all groups, or that, N = nM. The proposed framework allows for non-uniform group sizes with minor adjustments, so long as the group sizes are of the same order.

For group m ∈ {1,…,M}, we observe the response vector ymRn, generated as

ym=Xmβ+Zmvm+ϵm,m=1,,M (2.1)

with

  1. βRp, an unknown vector of fixed regression coefficients;

  2. vmRq, m = 1,…,M vectors of group-specific random effects, with vmi.i.d.N(0,Ψ), Ψ* an unknown q × q positive definite covariance matrix;

  3. errors ϵmi.i.d.N(0,σ2In×n) for unknown σ*2, which are independent of v1,…,vM; and

  4. XmRn×p and ZmRk×q known design matrices.

By construction, β* represents effects shared across groups while vm, m = 1,…,M, represent group-specific deviations. It will be convenient to write the model more compactly. Define vectors y=[y1T,,yMT]T, v=[v1T,,vMT]T, ϵ=[ϵ1T,,ϵMT]T, a stacked matrix X=[X1T,,XMT]T, and Z = diag(Z1,…,ZM). Then we can write (2.1) as

y=Xβ+Zv+ϵ. (2.2)

Marginalizing out the random effects yields

yN(Xβ,V(σ2,Ψ))withV(σ2,Ψ))=σ2IN×N+ZΨ)ZT, (2.3)

where Ψ*(B) = IM×MΨ*. This implies that V(σ*2, Ψ*) is block-diagonal and observations belonging to different groups are independent. Thus, the inclusion of random effects only induces dependencies between observations belonging to the same group. We will be primarily working with the marginal form (2.3) in subsequent sections.

We study the presented model under the following assumptions:

  1. High dimensions: We allow p, the number of fixed regression coefficients, to be possibly much larger than N. On the other hand, q, the number of random effect variables, is assumed to be of constant order, or at least smaller than n.

  2. Sparsity of β*: We assume β* to be sparse in the sense that most of its elements are zero: a more precise specification on the level of sparsity required is detailed in Section 3.2.

  3. Structure of Ψ*: Our paper primarily considers the scenario of Ψ* = τ*2Iq×q. However, our method, and corresponding theoretical results, can be extended to accommodate the more general scenario of Ψ* = D* where D* is a diagonal q × q matrix.

  4. Standardization of design matrices: The design matrices X and Z are assumed fixed and standardized with xj22=N for j ∈ {1,…,p} and zj22=n for j ∈ {1,…,qM}.

3. A ridge-based inferential framework

We would like to test null hypotheses of the form (1.1), i.e., H0,G:βj=0 for all jG, for subsets G ⊂ {1,…,p}, and construct confidence intervals for βj. This section formally introduces our inferential framework. We first describe its foundation, the de-biased ridge estimator, and show how it can be used to accomplish these tasks. We then detail how to assemble the components needed to construct this de-biased ridge estimator and approximate its limiting distribution. Theoretical justification of our approach is provided along the way.

3.1. A de-biased ridge estimator

As in Bühlmann (2013), our starting point is the ridge estimator given by

β^=argminβRpyXβ22N+λβ22. (3.1)

This estimator is natural in models with homoscedastic and uncorrelated errors but in the linear mixed effect model, the random effects results in correlation. We thus refer to β^ from (3.1) as the ‘naive’ ridge estimator. The estimator has a simple closed form expression,

β^=N1(Σ^+λIp×p)1XTY, (3.2)

where Σ^=XTXN. It is straightforward to show that the ridge estimator is normally distributed with covariance matrix, multiplied by a factor of N,

Ω=(Σ^+λIp×p)1XTV(σ2,τ2)X(Σ^+λIp×p)1N. (3.3)

As in Bühlmann (2013), we assume that the diagonal entries of Ω=(ωjk) satisfy

ωminminj{1,,p}ωjj>0. (3.4)

Likewise, we do not require (3.4) to be bounded away from 0 as a function of N or p. This condition, in fact, is fairly mild; it is only violated under special kinds of design matrices. To illustrate, define R ≡ rank(X) and let X = QDΓT be the singular value decomposition with left singular vectors QRN×N satisfying QTQ = IN×N, DRN×N a diagonal matrix with entries s1 ≥ … ≥ sN (i.e., singular values of X), and right singular vectors ΓRp×N satisfying ΓTΓ = IN×N. Let νmin(A) and νmax(A) be the smallest and largest eigenvalue of any square matrix A, respectively. We can then show the following.

Lemma 1. Condition (3.4) holds if and only if X0 and

minj{1,,p}maxk{1,,N},sk0Γjk2>0. (3.5)

In the high-dimensional case with RN < p, the parameter β* is not identifiable: many vectors θRp satisfy Xβ* = Xθ. A natural parameter to consider, as noted in Shao and Deng (2012), is θ* = PXTβ* = XT(XXT)Xβ* = ΓΓTβ*, the projection of β* onto the linear space generated by the rows of X. As it turns out, under condition (3.4), or equivalently (3.5), the ridge estimator β^ is a reasonable proxy for θ* when λ is sufficiently small.

Proposition 2. Suppose that λ > 0 and (3.4), or equivalently, (3.5), holds. Then, under our linear mixed effect model from Section 2, the ridge estimator (3.2) satisfies

maxj{1,,p}E[β^j]θjλθ2νmin,+(Σ^)1,minj{1,,p}Var[β^j]Nωmin

where νmin,+(Σ^) refers to the smallest non-zero eigenvalue of Σ^.

Proposition 2, which is proven in the Appendix, implies that the bias in estimating θ* with β^ is small when λ > 0 is sufficiently small. We explicitly quantify how small λ needs to be for the estimation bias to be smaller than the standard error of β^.

Corollary 3. Suppose that the ridge penalty parameter λ > 0 is chosen such that λωminνmin,+(Σ^)Nθ2, and that condition (3.4), or equivalently, (3.5) holds. Then,

maxj{1,,p}E[β^j]θjminj{1,,p}Var[β^j].

Our interest, however, lies in β*, not θ*. Thus, for β^ to be useful, we need to adjust β^ for the projection bias Bj=θjβj. By definition of θ*, one observes that

Bj=(PXTβ)jβj=(PXT)jjβjβj+kj(PXT)jkβk, (3.6)

which, under the null hypothesis H0,j : βj=0, becomes,

BH0,j=kj(PXT)jkβk. (3.7)

The quantity can be approximated by

B^H0,j=kj(PXT)jkβ^kinit. (3.8)

where β^init is a consistent initial estimator of β* (and consistency occurs under additional assumptions). Consider then the corrected ridge estimator β^jcorr as a statistic for testing H0,j:

β^jcorr=β^jB^H0,j=β^jkj(PXT)jkβ^kinit. (3.9)

Assuming that minj{1,,p}ωmin>0, we can write

β^jcorr=Wj+γj,

where

γj=(PXT)jjβjkj(PXT)jk(β^kinitβk)+δj,δj=δj(λ)=E[β^j]θj.

A rearrangement of the above set of equations yields

β^jcorr(PXT)jjβj=Wj(PXT)jjkj(PXT)jk(PXT)jj(β^kinitβk)+δj(PXT)jj. (3.10)

Then, from model (2.3), it follows that

W1,,WpN(0,ΩN). (3.11)

The normalizing factors needed to bring the Wj to N(0, 1) scale are given by κj=κj(N,p)=Nωjj. The proof is straightforward.

Theorem 4. Suppose we choose the ridge penalty parameter λ > 0 such that

λ(ωmin)12=o(νmin,+(Σ^)(N12θ2)),(N,p), (3.12)

and assume that for our choice of β^init, there exist constants Cj = Cj(N, p) such that

P[j=1p{κj(N,p)kj(PXT)jk(β^kinitβk)Cj(N,p)}]1(N,p). (3.13)

Then, under the null hypothesis, H0,j, for all w > 0,

limsupN,pP[κjβ^jcorr>w]P[W~+Cj>w]0, (3.14)

where W~N(0,1). In addition, for any sequence of subsets Gp ⊆ {1,…,p}, if H0,Gp is true, then for any w > 0,

limsupN,pP[maxjGpκjβ^jcorr>w]P[maxjGp(W~+Cj)>w]0. (3.15)

In subsequent sections, we identify specific scalings of N and p such that Theorem 4 becomes applicable. Based on the asymptotic distributions in Theorem 4, we can construct p-values for testing H0,G, G ⊆ {1,…,p}. For testing the individual null hypothesis H0,j, we define the p-value for the two-sided alternative as

ϱj=2(1Φ((κjβ^jcorrCj)+)), (3.16)

where Φ is the standard normal distribution function. For testing the group null hypothesis H0,G, ∣G∣ > 1, we define the p-value as

ϱG=1P[maxjG(κjWj+Cj)maxjGκjβ^jcorr], (3.17)

where W1,…,Wp are as in (3.11). From Theorem 4, we can derive the following corollary.

Corollary 5. Under the conditions in Theorem 4, for any α ∈ (0, 1), the following statements hold:

limsupN,pP[ϱjα]α0ifH0,jistrue,limsupN,pP[ϱGα]α0ifH0,Gistrue.

3.2. Consistent estimation of variance parameters

As presented, the de-biased ridge framework depends on the values of the unknown parameters σ*2 an τ*2. We employ a two-step approach to consistent estimation of these parameters.

  1. Let S={j:βj0} be the support of β*, with cardinality d = ∣S∣. We use the Lasso estimator β^L=argminβRpyXβ22N+2λLβ1 with an appropriate choice of tuning parameter λL to identify an initial guess of the elements (i.e., indices) in S. We define S^={j:β^jL0} as our guess for the support S. By properties of the Lasso, S^N, although, in general, S^ may not be a good estimate of S.

  2. Working with the (potentially misspecified) random effects model
    y=XS^βS^+Zb+ϵ, (3.18)

    we apply Henderson’s Method III (Henderson, 1953) to form estimates σ^2 and τ^2. Henderson’s Method III is particularly tractable theoretically and enables us to study consistency in the scenario where (3.18) is actually misspecified, i.e., SS^>0. For a discussion of Henderson’s methods and the appeals of Method III, see (Searle, 1968).

In recent years Henderson’s methods have largely been supplanted by alternatives such as restricted maximum likelihood (REML) for variance component estimation (Harville, 1977); it is customary to refer to variances of random effects as variance components. We thus provide a brief overview of what Henderson’s Method III entails. Consider, first, the low-dimensional model (2.3) with p < N. To simplify the notation in the following explanation, we momentarily define X~=[XZ]. By not distinguishing between fixed and random effects, the idea behind Henderson’s methods is to match the differences in the reductions in the sum-of-squares between sub-models of (2.3) to its expected value, not unlike a method-of-moments approach. To elaborate, in fitting (2.3) to data y, the reduction in the sum of squares is

R(β,v)=yTPX~y. (3.19)

Likewise, the decrease in the sum of squares due to fitting the reduced model y = Xβ + ϵ is

R(β)=yTPXy. (3.20)

The expected difference in the reductions R(vβ)R(β,v)R(β) is

E[R(vβ)]=τ2tr(ZT[IN×NPX]Z)+σ2[rank(X~)rank(X)]. (3.21)

Moreover,

E[yTyR(β,v)]=σ2[Nrank(X~)]. (3.22)

Together, (3.21) and (3.22), when matching theoretical expectations to empirical averages, form a triangular system of linear equations, from which we derive σ^2 and τ^2. We find

σ^2=yT(IN×NPX~)yNrank(X~), (3.23)
τ^2=yT(PX~PX)yσ^2[rank(X~)rank(X)]tr(ZT(IN×NPX)Z). (3.24)

It is straightforward to see that the σ^2 and τ^2 generated from (3.23) and (3.24) are unbiased, presuming that the true model is y = Xβ + Zv + ϵ. For consistency, some additional assumptions are needed, which we will discuss later in this section.

Returning to our two-step procedure and high-dimensional setup, Step 1 identifies a candidate low-dimensional sub-model, which is used in Step 2 to obtain variance component estimates. We do not require the candidate model to encompass the truth; however, λL should be such that S^, from Step 1, reliably captures the indices of the ‘strong’ signals in β*. The idea is that missing ‘weak’ signals only negligibly affect the accuracy of σ^2 and τ^2 in Step 2. We now show that this two-step procedure yields consistent estimators σ^2 and τ^2 in the setting where N → ∞ to (specifically, n is fixed, but the number of groups M → ∞) and d2 logp/M = o(1), provided some additional technical assumptions hold. From here on, this will also be the scaling assumed for Theorem 4, as well as Corollary 5. We first present the assumptions necessary for consistency and then formally state the theorem.

For ξ > 1, define the cone

C(ξ,S)={uRp:uSc1ξuS1}. (3.25)

Assumption 1. For some constant ξ > 1,

ζinf{Σ^uuA:uC(ξ,S),ASp}1 (3.26)

with C(ξ,S){u:uC(ξ,S),ujΣj,.u0jS} the sign-restricted version of (3.25).

The quantity ζ in (3.26) is defined more generally in Ye and Zhang (2010), where it is termed a sign-restricted cone invertibility factor (SCIF). We have the following lemma.

Lemma 6. Suppose Assumption 1 holds, and let λL be defined by (A.2) (or 3.28) for some small ε > 0 and ξ as in Assumption 1. If u* ≤ λL(ξ − 1)/(ξ + 1), then

β^LβλL+uζ2ξλL(ξ+1)ζ. (3.27)

In the proof of Lemma 6 (provided in the Appendix), SCIF naturally appears when deriving an upper bound for β^Lβ. Lemma 6 assumes that Assumption 1 is satisfied, and that λL in Step 1 is chosen such that

λL=(ξ+1)(ξ1)2(σ2+τ2qn)(logplog(ε2))NlogpqM=o(1), (3.28)

with ξ as in Assumption 1. It then establishes that

β^Lβ2ξλLζ(ξ+1)=o(1),

with probability exceeding 1 – ε, where ε > 0 can be taken arbitrarily small. A direct implication is that if the lemma’s conditions are satisfied, SS^ only includes indices corresponding to ‘weak’ signals in β* of magnitude less than 4ξλL(ξ + 1) = o(1) with close to certainty, which is part of what Step 1 sets out to achieve.

Assumption 2. There exists an integer N′d such that for the same constant ξ > 1 as in Assumption 1,

dξ2ψ2(ξ,S)<Nψ+(N,S), (3.29)

where

ψ(ξ,S)=min{d12Xu2N12uS1:uC(ξ,S),u0} (3.30)

and ψ+(N,S)=maxAS=,ANνmin(XATXAN) is the sparse upper eigenvalue of models disjoint with S.

Assumption 2 is needed to control the number of false positive selections in S^ from Step 1. In particular, we have

Lemma 7. Suppose that Assumption 2 holds, and λL is defined according to (3.28). In the event that u* ≤ λL(ξ − 1)/(ξ + 1), S^S<N.

Put simply, Lemma 7 claims that under Assumption 2 and our choice of λL from (3.28), the total number of false selections in Step 1 is bounded by N′, with probability exceeding 1 – ε. The proof is provided in the Appendix.

Assumption 3. Let Xˇ be formed by joining any N′ columns in X with βj=0 to the d support columns in X. For the same N′ as in Assumption 2,

rank([IN×NPXˇ]Z)=rank(Z)=qM, (3.31)
ZT[IN×NPXˇ]ZIqM×qM, (3.32)

and the qM singular values of [IN×NPXˇ]Z,s1,,sqM, satisfy

{i:si0}(i=1qMsi2)2=o(1). (3.33)

By (3.31) in Assumption 3, the fixed data matrix Z has full column rank, and no column vector of Z can be represented as a linear combination of the column vectors of any ‘feasible’ XS^ assuming that λL is chosen according to (3.28). After all, N′ + d is the upper bound on the number of selected fixed effects with probability exceeding 1 – ε (Lemma 7). Additionally, by (3.32), the sum of the squared perpendicular distances between each column vector in Z and its projection onto the linear subspace spanned by the column vectors of feasible XS^ matrices is at least on the order of qM (substantial, given there are qM columns in Z). The latter half of Assumption 3 requires all columns of (IN×NPXˇ)Z are ‘close’ to being linearly independent from one another and ‘contribute equally’ to its rank. In particular, note that (3.33) is satisfied if

c1<sjsk<c2forjkand some constantsc1,c2>0. (3.34)

It is thus clear that (3.31) and (3.32) imply that random effects must not be confounded with any ‘feasible’ set of fixed effects (from Step 1) while (3.33) implies that the random effects are not confounded from one another. Analogous conditions were shown to be necessary to prove consistency of REML estimators in Jiang (1996).

Assumption 4. For any jS such that βj<4ξλLζ(ξ+1), with λL defined as in (3.28), ΓX~xj1. Here, ΓX~DX~ΓX~T is the eigen-decomposition of X~(X~TX~)X~T (defined for this Assumption) with X~=[XˇZ], where Xˇ is formed by joining any N′ columns in X with βj=0 to the d – 1 support (excluding j) columns in X. The N′ referenced here is the same as in Assumptions 2 and 3.

Assumption 4 requires that covariates corresponding to weak (but non-zero) signals in β* (for which we cannot quantify a bound on the probability they are to be included in XS^) are not too strongly correlated to covariates in XS^ nor covariates associated with the random effects. This somewhat resembles the irrepresentability conditions needed for model selection consistency in Lasso–see, e.g., Zhao and Yu (2006). However, the two assumptions are very different: Aside from differences in the quantities involved, a key difference is that the irrepresentability condition requires a very stringent upper bound on non-confounding between fixed effects, whereas Assumption 4 only requires boundedness. As shown in the numerical experiments in the Appendix, as the number of covariates and sparsity of the model vary, Assumption 4 is very likely to be satisfied with even small bounds, whereas the irrepresentability condition is increasingly less likely to hold.

We can now state our main result on consistency of variance component estimators, which validates our two-step procedure.

Theorem 8. Consider N, p → ∞ with n fixed, M → ∞. Furthermore, suppose p → ∞ with d2q log p/M = o(1). Suppose Assumptions 1-4 are satisfied and ΛL is chosen according to (3.28) with ε ∝ 1/p. Then, σ^2 and τ^2 are consistent for σ*2 and τ*2, respectively, i.e.,

σ^2σ2=τ^2τ2=oP(1)(N,p). (3.35)

Because σ^2σ2 and τ^2τ2 are both oP(1), we can use σ^2 and τ^2 as plug-in values for σ*2 and τ*2, respectively. From there, we can form a consistent estimator of Ω* and normalizing constants κj.

For practical applications, REML can be used as a substitute for Henderson’s Method III for Step 2. Theory for REML would be a possible avenue for further explorations.

3.3. An initial estimator for β* and our choice of Cj

To form β^init, we consider the ordinary least-squares (OLS) fit restricted to S^, i.e.,

β^init=argminβRp:βS^c=0yXβ22. (3.36)

We proceed to demonstrate that the error β^initβ is o(1) in 1 norm.

Assumption 5. For the same N′ as in Assumptions 2, 3, 4, the sparse lower eigenvalue for models containing S of cardinality smaller than d + N′ is constant and greater than 0,

ψ(N,S)=minAS,ASNνmin(XATXAN)1,

Assumption 5, in conjunction with previous assumptions and choice of ΛL (3.28), can be used to control the 1 norm of the estimation error β^initβ.

Theorem 9. Suppose Assumptions 1-5 hold. Under the same conditions as in Theorem 8, for some universal constant C > 0,

β^initβ1CdqlogpM (3.37)

with probability converging to 1 as N, p → ∞.

Theorem 9 implies that we have the following crude bound, based on Hölder’s inequality,

κjkj(PXT)jk(β^kinitβk)κjmaxkj(PXT)jkβ^initβ1κjmaxkj(PXT)jkCdλL. (3.38)

The following corollary is a direct consequence of the crude bound (3.38).

Corollary 10. Suppose the conditions in Theorem 9 are satisfied, and that d, the sparsity of β*, satisfies dC−1 (M/(q log p))η, with C as in Theorem 9 and η ∈ (0, 1/2). Then,

Cj=maxkjκj(PXT)jk(qlogpM)12η (3.39)

satisfies condition (3.13) in Theorem 4.

4. Numerical experiments

4.1. A practical choice for λL

In practical applications, we run into the issue of not being able to set λL according to (3.28), as it involves knowing τ* and σ*. However, we can derive a (slightly ad-hoc) approximation of what λL should be. Upon closer examination of the proof of Lemma 11, we can substitute the term σ*2 + τ*2qn with νmax(V(σ*, τ*)) = σ*2 + τ*2νmax(ZTZ). The latter can be approximated according to the following procedure, assuming that the ratio τ*/σ* is not too small:

  1. Apply scaled lasso (Sun and Zhang, 2012) to obtain an initial ‘average’ noise estimate. The solution to the scaled lasso problem is characterized by
    (β^scaled,σ^scaled)argminβ,σyXβ222Nσ+σ2+λunivβ1 (4.1)

    with λuniv=2logpN.

  2. Take λL=σ^scaledλunivρZ with

ρZ=νmax(ZTZ)tr(ZTZ)N. (4.2)

We provide a heuristic justification. Ignoring the finer details involved in the theory, for the scaled lasso, (σ^scaled)2 serves as a good approximation for ϵ22N, where we have defined ϵ* = yXβ*. In linear models, ϵ* holds i.i.d. observations drawn from a N(0, σ*2) distribution. By the law of large numbers, ϵ22N converges to σ*2 for large N. Under a heteroskedastic error model, with ϵ* independent and ϵiN(0,σi2), we can match ϵ22N to its expectation, which is given by i=1Nσi2N, so (σ^scaled)2 can be used to approximate the ‘average’ noise level. If ϵ* ~ N (0, V(σ*, τ*)), then using a similar expectation matching argument, we can expect (σ^scaled)2 to act as a surrogate for

σ2+τ2tr(ZZT)N, (4.3)

which follows from the fact that ∥Γϵ*∥2 = ∥ϵ*∥2 for any N × N orthogonal matrix Γ (overloading Γ from (3.5)). What we actually need is σ*2 + τ*2νmax(ZTZ). Then in the scenario where ratio τ*2/σ*2 is not too small, ρZ from (4.2) should give us a choice of λL that is close to the desired one from (3.28). Our choice of λL is constructed according to the above procedure for all subsequent numerical experiments.

4.2. A look into p-values

Denote the ‘unblocked’ version of Z as Zu; i.e., Zu is a N × q matrix formed by row-wise concatenating the M diagonal blocks in Z. We generate data from model (2.1) according to following schemes, setting M = 25 and n = 6:

  • (M1) For p ∈ {300, 600}, q ∈ {1, 2}, we construct [X Zu] from N i.i.d. realizations from a N(0,Φ) distribution with Φ* = {ϖjk} a (p + q) × (p + q) matrix with ϕjk=0.2jk. X and Z (the ‘blocked’ version) are then normalized such that xj22=N and zj22=n for all j. For b ∈ {0.5, 1}, we set the p-dimensional vector of fixed regression coefficients to
    β=[b,,b,0,,0],

    where, the first d ∈ {5, 10} entries of β are nonzero. The variance parameters σ* and τ* are set to 0.5 and 1 respectively.

  • (M2) Same as (M1) except with Φ* = I(p+q)×(p+q).

The numerical experiments are setup similarly to those in Bühlmann (2013) and Schelldorfer et al. (2011). We set the ridge penalty parameter λ to 1/N for all experiments. Additionally, we set Cj according to Corollary 10 with η = 0.005.

We first consider null hypotheses of the form

H0,j:βj=0. (4.4)

We consider decision rules based on a significance level α = 0.05, i.e., we reject H0,j if the event Ej = {ϱj ≤ 0.05} occurs, where ϱj is as defined in (3.16). Following Bühlmann (2013), we evaluate the performance of the tests based on the type I error, averaged over the non-support indices,

Avg. type I error=(pd)1jScP^(Ej), (4.5)

and the power, averaged over the support indices,

Avg. power=d1jSP^(Ej), (4.6)

where P^ denotes the empirical probability over 1000 simulations. The results, presented in Figure 1, suggest that type I error is well-controlled for all combinations of p, q, b and d for the two different models. Power is high in most scenarios, but appears to vary with the aforementioned quantities, noticeably decreasing with b. However, this is to be expected.

Figure 1:

Figure 1:

Average power vs. average type I error for testing groups of coefficients under the two models for different combinations of p, q, b and d.

We also consider null hypotheses of the form

H0,G:βj=0for alljG. (4.7)

with G taken either to be {1,…,100} (G1), or {101,…,200} (G2). By construction, the hypothesis H0,G1 should be accepted while H0,G2 rejected. We consider decision rules based on a significance level α = 0.05 and reject H0,G if the event EG = {ϱG ≤ 0.05} occurs, with ϱG defined in (3.17). To evaluate the performance of these tests, we consider type I error and power, which can be represented by P^(EG2) and P^(EG1), respectively, where again, P^ denotes the empirical probability over 1000 simulations. Figure 2 visualizes the results.

Figure 2:

Figure 2:

Average power vs. average type I error for testing groups of coefficients under the two models for different combinations of p, q, b and d.

4.3. Comparisons with existing methods

In this section, we conduct a short numerical example to examine whether one could ‘naively’ apply inferential procedures for high-dimensional linear models to obtain inference for parameters in mixed models.

Consider Model (M1) from Section 4.2 in the instance of p = 300 and q = 1. Let β* = [0.05, 2, 4, 3, 0.1, 0,…, 0]. We compare our method against

  1. ridge-based inference procedure of Bühlmann (2013), which is an analogue of our method developed for high-dimensional linear models;

  2. lasso-based inference procedure of van de Geer et al. (2014), which entailes de-sparsifying a lasso estimator.

The differences are fairly evident when comparing confidence interval coverage. For any α ∈ (0, 1), define Qα[Wj] as the α-th quantile of the distribution of Wj. Under the conditions of Theorem 4, if the assumed model is correct, (3.11) suggests that confidence intervals of the form

[β^jcorr(PXT)jjQ1α2[Wj]+Cj(PXT)jj,β^jcorr(PXT)jj+Q1α2[Wj]+Cj(PXT)jj]

should guarantee coverage of at least (1 – α)%. Rather than setting Cj according to Corollary 10, we set them to be the same as the ‘Cj-analogues’ from Bühlmann (2013), to make the two methods comparable. Our choice of Cj are larger than theirs, so if anything, this ad-hoc decision provides Bühlmann (2013)’s method an unfair advantage. In Figure 3, we examine 95% confidence interval coverage for the three methods, based on the above modifications.

Figure 3:

Figure 3:

Confidence interval coverage for βj, j = 1,…,p; target coverage is 95% (with 1000 simulations, the standard deviation is ~0.69%). Here, lmm (Inline graphic) refers to our method; ridge (Inline graphic) to the method of Bühlmann (2013); and lasso (Inline graphic) to the method of van de Geer et al. (2014).

Overall, our method, which accounts for random effects, performs best at attaining the target guaranteed coverage across all βjs, compared to the methods proposed in Bühlmann (2013) and van de Geer et al. (2014). While Bühlmann (2013)’s method does come close, coverage falls short at 16 indices: minimum coverage achieved was 92.9% (with 1000 simulations, this is a statistically significance difference from 0.95). At initial glance it appears that the lasso-based method from van de Geer et al. (2014) performs quite well; however, a closer examination of the results reveals otherwise. Specifically, the lasso-based method does very poorly over some of the active coefficients, as made evident in Table 1.

Table 1:

Confidence interval coverage for signals βj, j = 1,…, 5; target coverage is 95%.

Our method Bühlmann (2013) van de Geer et al. (2014)
β1 0.977 0.974 0.994
β2 0.973 0.963 0.865
β3 0.969 0.971 0.782
β4 0.971 0.972 0.886
β5 0.983 0.993 1.000

5. An application to riboflavin production data

In this section, we apply our proposed methodology to data on riboflavin (vitamin B2) production by Bacillus subtilis. The data is made publicly available by Bühlmann et al. (2014); the original data was provided by DSM (Switzerland). The dataset, referenced as riboflavinGrouped, has M = 28 specimens measured at two to six time points, resulting in N = 111 observations in total. For each specimen at each time point, we record a single real valued response variable, the log-transformed riboflavin production rate, as well as the expression levels of p = 4088 genes. We are interested in identifying which gene is significantly correlated with riboflavin production.

To account for correlations induced by repeated measurements, a natural model to consider is the random intercept model, in which we assume that

ym=Xmβ+vm+ϵm, (5.1)

with vm, m = 1,…, M i.i.d. with vm ~ N(0, τ*2), and ϵm, m = 1,…, M, independent with ϵm ~ N(0,σ*2Inm×nm), and generated independently of v1,… vm. Note that (5.1) can be represented by (2.1) with the Zm’s taken to be column vectors of 1s of lengths nm. Most of the theoretical results assume the nm’s are equal, but it is straightforward to show the results hold so long as nm are on the same order of magnitude, as they are here.

We apply our proposed framework and compute the marginal p-values for testing βj=0. Controlling the family-wise error rate (FWER) at 5%, via a simple Bonferroni correction, we find a single significant gene in riboflavin production: YXLD-at. This result matches previous findings by Javanmard and Montanari (2014) and Meinshausen et al. (2009) using an homogeneous dataset with N = 71 samples provided by the same source (riboflavin in Bühlmann et al. (2014)). Like us, Meinshausen et al. (2009) makes a single discovery, YXLD-at, while Javanmard and Montanari (2014) also labels YXLE-at as significant. The method of Bühlmann (2013), on the other hand, makes no discoveries.

6. Discussion

We presented a new framework for constructing asymptotically valid p-values and confidence intervals for the fixed effects in high-dimensional linear mixed effect models. It entails de-biasing a ‘naive’ ridge estimator, whose asymptotic distribution we can approximate sufficiently well if the number of independent groups of observations M scales at least with d2q log p. Simulation studies in high-dimensional suggest that our method provides good control of type-I error. It also provides good results for a riboflavin dataset with group structure, where we confirmed results obtained in earlier work based on a homogeneous dataset from the same source (Javanmard and Montanari, 2014; Meinshausen et al., 2009).

Several extensions to our methodology would be of interest for future work. First, our proposal for selecting the tuning parameter λL relies on the assumption that τ*2/σ*2 is not too small. Although it appears to work well in practice, one could also consider an iterative scheme that repeatedly updates λL based on the resultant estimates of σ*2 and τ*2: this can be readily implemented in practice but may be difficult to validate theoretically. Second, here we required the number of random effects q to be quite small (treated as constant in the theory). This assumption can be relaxed by, e.g., taking Ψ* to be a general diagonal matrix, i.e., Ψ=diag(τ12,,τq2), and assuming that a small number of τj2s are nonzero, i.e., cardinality of T{j:τj20} is small, less than n. Then, instead of screening for fixed effects in Step 1, we can screen for both fixed and random effects by incorporating a double penalization scheme as in Li et al. (2018). This way, in Step 2, both S^ and T^ are small, and we can apply Henderson’s method III as before.

A few other details should also be discussed for completeness. First, multiple testing can be handled using the Westfall-Young procedure of Bühlmann (2013). This multiple testing adjustment, which strongly controls the family-wise error rate, can directly be used in conjunction with our method for generating p-values for the individual hypothesis tests. Second, the ridge-based framework of Bühlmann (2013), which is a basis for our method, is known to not have optimal power. Bühlmann (2013) shows that the detection rate may be larger than N−1/2, whereas, under certain conditions, the detection limit for the de-biased lasso approach of Zhang and Zhang (2014) is in the N−1/2 range. A possible extension of our work is to build a lasso-based inferential framework for high-dimensional linear mixed effect models. In fact, as suggested in the Introduction, our methods can be adapted to other high-dimensional estimators; and ridge is just an example. From van de Geer et al. (2014), we can obtain asymptotically optimal inference for linear fixed effect models—i.e., for y = Xβ* + ϵ with N observations and ϵi i.i.d N(0, σ*2)—by leveraging the fact that the Lasso estimator with non-negative penalty parameter λ, β^(λ), can be rewritten as

β^(λ)β+λΘ^ı^=λΘ^XTϵNΔN,whereΔN(Θ^Σ^Ip×p)(β^(λ)β)

by inverting the KKT conditions, with ı^ arising from the subdifferential of ∥β1. Taking Θ^ to be a reasonably good approximation of an inverse of Σ^, the Δ term becomes asymptotically negligible, and we can use the normality of ϵ to develop asymptotically valid tests and confidence intervals for β*. (The scaled lasso furnishes a consistent estimator of σ*2.) Extending this approach to the linear mixed-effect setup (per Section 2) requires meeting the challenge that the ϵi are no longer i.i.d., which could be addressed using the methods of Section 3.2.

A. Appendix

A.1. Proof of Results in Section 3

A.1.1. Proof of Lemma 1

It is straightforward to show that Ω* can be lower bounded as

Ωc(Σ^+λIp×p)1Σ^(Σ^+λIp×p)1Ω~,

for some c satisfying 0 < c < νmin (V(σ*2, τ*2)). Since σ*2 is positive, νmin (V(σ*2, τ*2)) > 0. Note that Ω~ can alternatively be written as

Ω~=Γdiag(s12(s12+λ)2,,sN2(sN2+λ)2)ΓT,

which, in turn, implies that

ω~min=minj{1,,p}k=1Nsk2(sk2+λ)2Γjk2,

and the claim follows.

A.1.2. Proof of Proposition 2

This was proven in Shao and Deng (2012) (see proof of their Theorem 1). Define Γ = [Γ′ (Γ)]; Γ′ is orthogonal, i.e., Γ′TΓ′ = Γ′Γ′T = Ip×p . By definition (3.2), we have

E[β^]θ=1N(Σ^+λIp×p)1XTXθθ=(λ1N1XTX+Ip×p)1θ=Γ(λ1N1ΓTXTXΓ+Ip×p)1ΓTΓΓTθΓ(λ1N1D2+IR×R)1ΓTθ.

Observing that the diagonal entries to D are positive, one obtains

(λ1N1D2+IR×R)1¯λ1νmin,+(Σ^)1+λ1νmin,+(Σ^)IR×R, (A.1)

which, combined with the fact that ΓTΓ = IR×R, we obtain

maxj{1,,p}E[β^j]θjλθ2νmin,+(Σ^)1,

as desired. The bound on the variance follows directly from (3.3).

A.2. Proof of Theorems 4, 8 and 9

We first establish Theorem 4, which follows directly from Proposition 2.

A.2.1. Proof of Theorem 4

It follows from Proposition 2 that

maxjκjδj=maxjκjE[β^j]θjλθ2νmin,+(Σ^)1N12ωjj12λθ2νmin,+(Σ^)1N12ωmin12,

which, due to our choice of the ridge penalty parameter λ > 0 in (3.12), is o(1) as N, p α ∞. The claim now follows from (3.10) and the assumption given by (3.13).

Because there is an overlap in the lemmas used to prove Theorems 8 and 9, we present them together. Define u* = ∥XT(yXβ*)∥/N.

Lemma 11.Let
λL=(ξ+1)(ξ1)2(σ2+τ2qn)(log(p)log(ε2))N. (A.2)

Under the model given by (2.3), the event u* ≤ λL(ξ – 1)/(ξ + 1) occurs with probability greater than 1 – ε.

Proof. Define uj=xjT(yXβ)N. Then u* = maxjuj∣. Under model (2.3), we observe that,

ujN(0,xjTV(σ2,τ2)xj)

It follows from the Gaussianity of uj (in fact, sub-Gaussianity would suffice) that

P[uj>λL(ξ1)(ξ+1)]2exp{λL2(ξ1)2(ξ+1)22xjTV(σ2,τ2)xj}2exp{λL2(ξ1)2(ξ+1)22Nvmax(V(σ2,τ2))}εp. (A.3)

The second inequality follows from the fact that the columns of X are standardized such that xj22=Nj. For the third inequality, recall that the columns of Z are standardized such that zj22=nj, which implies that the largest eigenvalue of V(σ*2, τ*2), the true covariance of y, satisfies νmax(V(σ*, τ*)) ≤ σ*2 + τ*2qn. The third inequality in (A.3) is obtained by plugging in our choice of λL (3.28). Employing a union bound, we then have

P[uλL(ξ1)(ξ+1)]1j=1pP[uj>λL(ξ1)(ξ+1)]1ε,

This is our desired result.

A.2.2. Proof of Lemma 6

We use arguments similar to those employed in the proof of Theorem 3 in Ye and Zhang (2010). Suppose that u* ≤ λL. Define h=β^Lβ. The Karuhn-Kush-TUcker (KKT) optimality conditions for Lasso is given by

{xjT(yXβ^L)N=λLsign(β^jL),β^jL0,xjT(yXβ^L)NλL[1,+1],β^jL=0.}

With some rearrangement, the KKT conditions can be rewritten as

XT(yXβ)Σ^hN=λLı^ (A.4)

with ı^Rp and ıj=sign(β^jL) if jS^ and ιj ∈ [−1, +1] otherwise: the subdifferential which arises from ∥β1. Rearranging (A.4) and observing that sign(β^jL)=sign(hj) for jS yields

hTΣ^h(u+λL)hS1+(uλL)hSc1

for all vectors h’ with sign(hSc)=sign(hSc). If we take h’ = h, one can see that hC(ξ,S):

0hTΣ^h(u+λL)hS1+(uλL)hSc1hSc1(u+λL)(λLu)hS1ξhS1.

On the other hand, setting h′ to be any vector so that for some jSc, hj=hj and 0 elsewhere gives

hjΣ^j,.h(uλL)hj0,

which implies that hC(ξ,S). The KKT conditions (A.4) also tell us that

Σ^hu+λL,

which, when combined with the definition of ζ (3.26) yields

hu+λLζ,

which is the desired result. In the event that u* ≤ λL(ξ – 1)/(ξ + 1), we have

h2λLζ(ξ+1).

A.2.3. Proof of Lemma 7

The proof is adapted from that of Theorem 3 in Sun and Zhang (2012). By construction, β^L satisfies the KKT conditions from (A.4) which implies that

xjTX(β^Lβ)N=xjT(yXβ^Lϵ)NxjT(yXβ^Lϵ)NxjTϵNλLu.

For AS^S, such that AN, the previous inequality implies

(λLu)2AjAxjTX(β^Lβ)2N2=jA(Xh)TxjxjT(Xh)N2κ+(N,S)Xh22N. (A.5)

Going back to the KKT conditions in (A.4), we have, for arbitrary hRp,

(Xβ^LXh)TXhNλL(h1β^L1)+uhβ^L1,

which, when combined with the fact that

2(Xβ^LXh)TXh=Xβ^LXh22+Xh22XβXh22

gives the inequality

Xh22NλL(β1β^L1)+uh1(λL+u)hS1. (A.6)

Thus, h lies in the cone in (3.25) in the event that u* ≤ λL(ξ – 1)/(ξ + 1) (by noting that the left-hand side is lower bounded by 0). By definition of κ(ξ, S) from (3.30),

Xh22N(λL+u)2dκ2(ξ,S),

which, when combined with (A.5) implies

Aκ+(N,S)ξ2dκ2(ξ,S)<N,

by Assumption 2.

A.2.4. Proof of Theorem 8

Suppose that u* ≤ λL(ξ − 1)/(ξ + 1). Then by Lemmas 6 and 7 and the referenced assumptions within, we have

β^Lβ2ξλL(ξ+1)ζβj4ξλL(ξ+1)ζfor alljSS^, (A.7)
S^SNS^N+dd. (A.8)

Denote X^=[XS^Z]. Under candidate model (3.18), our variance component estimators (via Henderson’s Method III) are given by

σ^2=yT(IN×NPX^)yNrank(X^),τ^2=yT(PX^PXS^)yσ^2[rank(X^)rank(XS^)]tr[ZT(IN×NPXS^)Z].

See (3.23) and (3.24).

Consider the more interesting scenario where SS^>0. If S is contained within S^, then it is straightforward to show that variance component estimators are consistent (the true model is a submodel of the proposed one). We first prove, under the given assumptions, that σ^2σ2=oP(1). Write SO=SS^, ‘O’ for omitted. Then,

σ^2=(X^βS^+XSOβSO+ϵ)T(IN×NPX^)(X^βS^+XSOβSO+ϵ)Nrank(X^)=βSOTXSOT(IN×NPX^)XSOβSONrank(X^)+2βSOTXSOT(IN×NPX^)ϵNrank(X^)+ϵT(IN×NPX^)ϵNrank(X^). (A.9)

We proceed to show that the three parts to (A.9) satisfy

ϵT(IN×NPX^)ϵNrank(X^)σ2=oP(1), (A.10)
2βSOTXSOT(IN×NPX^)ϵNrank(X^)=oP(1), (A.11)
andβSOTXSOT(IN×NPX^)XSOβSONrank(X^)=o(1), (A.12)

which would suggest that σ^2 is indeed consistent for σ*2. Note that the term in (A.12) is Bias(σ^2)=E[σ^2]σ2.

  1. Proving (A.11): Let ΓX^DX^ΓX^T represent the eigendecomposition of IN×NPX^. We note that the latter is idempotent, implying that the diagonal matrix DX^, which is of rank Nrank(X^), has only 0 and 1s as its eigenvalues. It is straightforward to show that
    E[2βSOTXSOT(IN×NPX^)ϵNrank(X^)]=0andVar[2βSOTXSOT(IN×NPX^)ϵNrank(X^)]=4σ2βSOTXSOTΓX^DX^ΓX^TXSOβSO[Nrank(X^)]24σ2ΓX^TXSOβSO2Nrank(X^)4σ2ΓX^TXSO2Nrank(X^)×d2qlogpM=o(1).

    Statement (A.11) then follows from Chebyshev’s inequality.

  2. Proving (A.10): Let χi2(1), i=1,,Nrank(X^), be i.i.d random variables following a χ2 distribution with 1 degree of freedom. Observe that,
    ϵT[IN×NPX^]ϵNrank(X^)=dϵTDX^ϵNrank(X^)=dσ2i=1Nrank(X^)χi2(1)Nrank(X^) (A.13)
    and
    σ2i=1Nrank(X^)χi2(1)Nrank(X^)σ2=oP(1). (A.14)

    Here, (A.14) follows from Lemma 7 and the fact that NN′ + d + qM, which implies that Nrank(X^) as M α ∞. Applying the Strong Law of Large Numbers (SLLN) for i.i.d. random variables, we arrive at (A.10).

  3. Proving (A.12): We observe that
    Bias(σ^2)ΓX^TXSO2×βSO2×d2d2qlog(p)M=o(1),

    and we have completed our proof that σ^2 is consistent under the stated assumptions.

We now demonstrate that the same claim holds for τ^2. Expanding out y, we obtain, after some algebraic manipulation,

τ^2=βSOTXSOT(PX^PXS^)XSOβSOtr(ZTZ)+2βSOTXSOT(PX^PXS^)ϵtr(ZTZ)+ϵT(PX^PXS^)ϵtr(ZTZ)+2βSOTXSOT(IN×NPXS^)Zvtr(ZTZ)+vTZTZvtr(ZTZ)+2vTZT(IN×NPXS^)ϵtr(ZTZ)σ2[rank(X^)rank(XS^)]tr(ZTZ)Bias(σ^2)[rank(X^)rank(XS^)]tr(ZTZ), (A.15)

where we have defined Z=(IN×NPXS^)Z. We set out to prove that the terms in (A.15) satisfy

2βSOTXSOT(PX^PXS^)ϵtr(ZTZ)=oP(1), (A.16)
ϵT(PX^PXS^)ϵtr(ZTZ)σ2[rank(X^)rank(XS^)]tr[ZTZ]=oP(1), (A.17)
2βSOTXSOT(IN×NPXS^)Zvtr(ZTZ)=oP(1), (A.18)
vTZT(IN×NPXS^)Zvtr(ZTZ)τ2=oP(1), (A.19)
2vTZT(IN×NPXS^)ϵtr(ZTZ)=oP(1), (A.20)

and

βSOTXSOT(PX^PXS^)XSOβSOtr(ZTZ)Bias(σ^2)[rank(X^)rank(XS^)]tr(ZTZ)=o(1). (A.21)

Let QZDZΓZT represent the singular value decomposition of Z′, with QZ′ and ΓZ′ of dimensions N × qM and qM × qM, respectively. Additionally, write the eigendecompositions of PX^PXS^ and IN×NPXS^ as ΓX^XS^DX^XS^ΓX^XS^T and ΓXS^DXS^ΓXS^T, respectively. Note that (A.19) and (A.21) make up Bias[τ^2]=E[τ^2]τ2.

To avoid repetition, some of the proofs are presented in abbreviated form.

  1. Proving (A.16): Clearly,
    E[2βSOTXSOT(PX^PXS^)ϵtr(ZTZ)]=0Var[2βSOTXSOT(PX^PXS^)ϵtr(ZTZ)]=4βSOTXSOT(PX^PXS^)XSOβSOtr[ZTZ]2=4βSOTXSOTΓX^XS^DX^XS^ΓX^XS^TXSOβSOtr[ZTZ]24×d2×qM×ΓX^XS^XSO×βSO2tr[ZTZ]2=o(1),

    following from (3.32) in Assumption 3 and Assumption 4.

  2. Proving (A.17): Orthogonality of ΓX^XS^ implies that ϵT(PX^PXS^)ϵ=dϵTDX^XS^ϵ, so
    E[ϵT(PX^PXS^)ϵtr(ZTZ)]=E[ϵTDX^XS^ϵtr(ZTZ)]=σ2[rank(X^)rank(XS^)]tr[ZTZ]
    and, using properties of quadratic forms, we have
    Var[ϵT(PX^PXS^)ϵtr(ZTZ)]=Var[ϵTDX^XS^ϵtr(ZTZ)]=2σ4rank(DX^XS^)tr(ZTZ)2=2σ4rank(Z)tr(ZTZ)21M=o(1),
    the latter relation following from (3.33) in Assumption 3. This proves (A.17).
  3. Proving (A.18): Proof is similar to that of (A.16), as we note that
    E[2βSOTXSOT(IN×NPXS^)Zvtr(ZTZ)]=0,Var[2βSOTXSOT(IN×NPXS^)Zvtr(ZTZ)]=4τ2βSOTXSOTQZDZ2QZTXSOβSOtr(ZTZ)24τ2tr(ZTZ)QZTXSOβSO2tr(ZTZ)2=o(1),
    having applied Assumptions 3 and 4 here.
  4. Proving (A.19): As for (A.17), using again properties of quadratic forms, we can show that
    E[vTZT(IN×NPXS^)Zvtr(ZTZ)]=τ2,Var[vTZT(IN×NPXS^)Zvtr(ZTZ)]=2τ4tr(DZ4)tr(DZ2)2=o(1),
    the last relation the result of (3.33) from Assumption 3.
  5. Proving (A.20): We can rewrite
    2vTZTϵtr(ZTZ)=2vTΓZDZQZTϵtr(ZTZ)=dστi=1qMsiBii=1qMsi2.
    where Bi, i = 1,… rank(Z′) are random variables formed as the product of two independent N(0, 1) random variables. Then,
    Var(i=1qMsiBi)=i=1qMsi2,
    which implies that
    Var(στi=1qMsiBii=1qMsi2)=1i=1qMsi2

    and since i=1qMsi2qM by (3.32) in Assumption 3, we have proven our claim (A.20).

  6. Proving (A.21): By the definition of Bias[τ^2], it is clear that
    Bias(τ^2)βSOTXSOT(PX^PXS^)XSOβSOtr(ZTZ)+Bias(σ^2)[rank(X^)rank(XS^)]tr(ZTZ)βSOTXSOTPZXSOβSOtr(ZTZ)+qM×Bias(σ^2)tr(ZTZ)qM×d×ΓZXSO2×βSO2tr(ZTZ)+qM×Bias(σ^2)tr(ZTZ)d2qlog(p)M=o(1),
    where the second last relation follows from the proven claim that Bias(σ^2) is o(1), (3.32) in Assumption 3 and Assumption 4.

Since the event u* ≤ λL(ξ − 1)/(ξ + 1) occurs with probability greater than 1 – 1/p → 1 as p → ∞,

σ^2σ2=oP(1),τ^2τ2=oP(1)

as claimed.

A.2.5. Proof of Theorem 9

Suppose that u* ≤ λL(ξ − 1)/(ξ + 1), and write SO=SS^. The OLS fit β^init has a simple closed-form expression:

β^S^init=(XS^TXS^)1XS^Ty=βS^+(XS^TXS^)1XS^TXSOβSO+(XS^TXS^)1XS^T(yXβ).

and β^S^cinit=0. Thus, by triangle inequality,

β^S^initβS^1(XS^TXS^)1XS^TXSOβSO1+(XS^TXS^)1XS^T(yXβ)1. (A.22)

We proceed by first bounding the first term on the right-hand side of (A.22). By Assumption 5,

(XS^TXS^)1XS^T2νmax[(XS^TXS^)1XS^TXS^(XS^TXS^)1]=νmax[(XS^TXS^)1]=1Nνmax[(XS^TXS^N)1]=1Nνmin(XS^TXS^N)1Nψ(N,S)1N.

This, in turn, implies that

(XS^TXS^)1XS^TXSOβSO1N+d(XS^TXS^)1XS^TXSOβSO2N+d(XS^TXS^)1XS^T2XSOβSO2N+d(XS^TXS^)1XS^T2XSOFβSO(N+S)dψ(N,S)2ξλL(ξ+1)ζd2qlog(p)M=o(1),

where the last relation follows from Assumption 2.

We proceed to bound the second component on the right-hand side of (A.22). We observe that

(XS^TXS^)1XS^Tϵ1N+d(XS^TXS^)1XS^T(yXβ)2N+d(XS^TXS^N)1XS^T(yXβ)N2N+d(XS^TXS^N)12XS^T(yXβ)N2(N+d)(XS^TXS^N)12XS^T(yXβ)NN+dκ(N,S)λLd2qlog(p)M=o(1).

From Lemma 6,

β^SOinitβSO=βSOqlog(p)M,

which implies that

β^SOinitβSO1d2qlog(p)M=o(1).

By Lemma 11, the event u* ≤ λL(ξ − 1)/(ξ + 1) occurs with probability exceeding 1 – 1/p. Combined, we obtain the desired result.

A.3. Empirical Evaluation of Assumption 4

To assess the stringency of Assumption 4 compared to the irrepresentability condition (Zhao and Yu, 2006), we conduct a simulation study similar to Zhao and Yu (2006), but customized to our mixed linear model setting.

Consider the model

y=Xβ+Zν+ε,

with XRnM×p and ZRnM×qM, βRp, νRp and εRnM. We consider q = 2 random effects, M = 25 groups, and n = 20 samples within each group. Among the p fixed effect covariates, we set d to have nonzero coefficients and the rest to have zero coefficients. More specifically, we set β=(1,,1d,0,,0pd)T.

To assess the stringency of the two assumptions, in each of B = 1000 simulation replications, we randomly generate design matrices X and Z jointly as [X, Zu] ~iid N(0, Σ), where Zu is the un-blocked version of Z. The covariance matrix Σ is generated from a Wishart(p+q, Ip+q) distribution, and X and Z are scaled such that xj22=nM and zj22=n. Following Zhao and Yu (2006), we consider p = 2k for k ∈ {3, 4,…, 8}, and set d = tp/8 for t ∈ {1, 2,…, 8}.

Let A* be the index of the true active set (hence ∣A*∣ = d). Further, let S ⊂ (A*)c with ∣S∣ = min(pd, p) be a random subset of variables with zero coefficients. For jA*, let X~=[XA{j},XS,Z] be the augmented design matrix, and denote Σ^=X^TX^N.

With the above notations, the irrepresentability condition is satisfied if TIR = maxjA* TIR,j < 1, where

TIR,j=Σ^(A{j})c,A{j}(Σ^(A{j})c,A{j})1sign(βA{j}).

Assumption 4 involves a related quantity, T4,j=ΓX~xj, where ΓX~DX~ΓX~T is the eigen-decomposition of X~(X~X~T)1X~T. However, this assumption is satisfied if T4 ≡ maxjA* T4,j = O(1). Thus, to satisfy Assumption 4, we need a constant C, not dependent on N and p, such that T4 < C. While C can be any large but fixed constant, in this simulation we consider a moderate value of C = 5.

The proportion of simulated data sets, where the irrepresentability assumption and Assumption 4 are satisfied are shown in Tables 2 and 3, respectively. As in Zhao and Yu (2006), the results in Table 2 indicate that the irrepresentability assumption can be stringent, especially as the dimension p and the number of nonzero coefficients d increase. In contrast, the results in Table 3 suggest that, for C = 5, Assumption 4 is much more likely to hold. Moreover, the proportion of cases for which this assumption holds does not change with p or d. While the appropriate choice of C is generally unknown, the results in this simulation suggest that even with moderate values (in this case C = 5) Assumption 4 is likely satisfied.

Table 2:

Proportions of cases satisfying the irrepresentability condition (Zhao and Yu, 2006).

TIR < 1 p = 8 p = 16 p = 32 p = 64 p = 128 p = 256
d = p/8 1 1 0.975 0.823 0.327 0.022
d = 2p/8 1 0.735 0.283 0.013 0 0
d = 3p/8 0.736 0.231 0.014 0 0 0
d = 4p/8 0.356 0.062 0.001 0 0 0
d = 5p/8 0.244 0.024 0 0 0 0
d = 6p/8 0.208 0.015 0 0 0 0
d = 7p/8 0.199 0.012 0 0 0 0

Table 3:

Proportions of cases satisfying Assumption 4.

T4 < 5 p = 8 p = 16 p = 32 p = 64 p = 128 p = 256
d = p/8 0.998 0.999 0.997 0.996 0.994 0.99
d = 2p/8 1 1 0.996 0.995 0.99 0.982
d = 3p/8 0.998 0.999 0.994 0.992 0.985 0.973
d = 4p/8 0.999 0.998 0.991 0.995 0.984 0.954
d = 5p/8 0.996 0.995 0.993 0.989 0.976 0.938
d = 6p/8 1 0.997 0.993 0.987 0.967 0.93
d = 7p/8 0.996 0.995 0.995 0.988 0.956 0.927

References

  1. Bühlmann P. Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242, 2013. ISSN 1350-7265. doi: 10.3150/12-BEJSP11. URL http://dx.doi.org.offcampus.lib.washington.edu/10.3150/12-BEJSP11. [DOI] [Google Scholar]
  2. Bühlmann P, Kalisch M, and Meier L. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014. [Google Scholar]
  3. Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340, 1977. ISSN 0162-1459. URL http://links.jstor.org.offcampus.lib.washington.edu/sici?sici=0162-1459(197706)72:358<320:MLATVC>2.0.CO;2-9&origin=MSN. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]
  4. Henderson CR. Estimation of variance and covariance components. Biometrics, 9:226–252, 1953. ISSN 0006-341X. doi: 10.2307/3001853. URL http://dx.doi.org.offcampus.lib.washington.edu/10.2307/3001853. [DOI] [Google Scholar]
  5. Javanmard A and Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. URL http://jmlr.org/papers/v15/javanmard14a.html. [Google Scholar]
  6. Jiang J. REML estimation: asymptotic behavior and related topics. Ann. Statist, 24(1):255–286, 1996. ISSN 0090-5364. URL 10.1214/aos/1033066209. [DOI] [Google Scholar]
  7. Lee JD, Sun DL, Sun Y, and Taylor JE. Exact post-selection inference, with application to the lasso. Ann. Statist, 44(3):907–927, 2016. ISSN 0090-5364. doi: 10.1214/15-AOS1371. URL 10.1214/15-AOS1371. [DOI] [Google Scholar]
  8. Li Y, Wang S, Song PX-K, Wang N, Zhou L, and Zhu J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Interface, 11(4):721–737, 2018. ISSN 1938-7989. doi: 10.4310/SII.2018.v11.n4.a15. URL 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Lockhart R, Taylor J, Tibshirani RJ, and Tibshirani R. A significance test for the lasso. Ann. Statist, 42(2):413–468, 2014. ISSN 0090-5364. doi: 10.1214/13-AOS1175. URL 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Meinshausen N and Bühlmann P. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol, 72 (4):417–473, 2010. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2010.00740.x. URL 10.1111/j.1467-9868.2010.00740.x. [DOI] [Google Scholar]
  11. Meinshausen N, Meier L, and Bühlmann P. p-values for high-dimensional regression. J. Amer. Statist. Assoc, 104(488):1671–1681, 2009. ISSN 0162-1459. doi: 10.1198/jasa.2009.tm08647. URL 10.1198/jasa.2009.tm08647. [DOI] [Google Scholar]
  12. Ning Y and Liu H. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist, 45(1):158–195, 2017. ISSN 0090-5364. doi: 10.1214/16-AOS1448. URL 10.1214/16-AOS1448. [DOI] [Google Scholar]
  13. Schelldorfer J, Bühlmann P, and van de Geer S. Estimation for high-dimensional linear mixed-effects models using 1-penalization. Scand. J. Stat, 38(2):197–214, 2011. ISSN 0303-6898. doi: 10.1111/j.1467-9469.2011.00740.x. URL 10.1111/j.1467-9469.2011.00740.x. [DOI] [Google Scholar]
  14. Searle SR. Another look at henderson’s methods of estimating variance components. Biometrics, 24(4):749–787, 1968. ISSN 0006341X, 15410420. URL http://www.jstor.org/stable/2528870. [Google Scholar]
  15. Shah RD and Samworth RJ. Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(1):55–80, 2013. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2011.01034.x. URL 10.1111/j.1467-9868.2011.01034.x. [DOI] [Google Scholar]
  16. Shao J and Deng X. Estimation in high-dimensional linear models with deterministic design matrices. Ann. Statist, 40(2):812–831, 2012. ISSN 0090-5364. doi: 10.1214/12-AOS982. URL 10.1214/12-AOS982. [DOI] [Google Scholar]
  17. Sun T and Zhang C-H. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. ISSN 0006-3444. doi: 10.1093/biomet/ass043. URL 10.1093/biomet/ass043. [DOI] [Google Scholar]
  18. Tibshirani RJ, Taylor J, Lockhart R, and Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. ArXiv e-prints, Jan. 2014. [Google Scholar]
  19. van de Geer S, Bühlmann P, Ritov Y, and Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3):1166–1202, 2014. ISSN 0090-5364. doi: 10.1214/14-AOS1221. URL 10.1214/14-AOS1221. [DOI] [Google Scholar]
  20. Wasserman L and Roeder K. High-dimensional variable selection. Ann. Statist, 37(5A):2178–2201, 2009. ISSN 0090-5364. doi: 10.1214/08-AOS646. URL 10.1214/08-AOS646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ye F and Zhang C-H. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. J. Mach. Learn. Res, 11:3519–3540, Dec. 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953043. [Google Scholar]
  22. Yu Y, Bradic J, and Samworth RJ. Confidence intervals for high-dimensional Cox models. ArXiv e-prints, Sept. 2018. [Google Scholar]
  23. Zhang C-H and Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol, 76(1):217–242, 2014. ISSN 1369-7412. doi: 10.1111/rssb.12026. URL 10.1111/rssb.12026. [DOI] [Google Scholar]
  24. Zhao P and Yu B. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541–2563, 2006. [Google Scholar]

RESOURCES