Abstract
This paper concerns the development of an inferential framework for high-dimensional linear mixed effect models. These are suitable models, for instance, when we have n repeated measurements for M subjects. We consider a scenario where the number of fixed effects p is large (and may be larger than M), but the number of random effects q is small. Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators to perform inference for high-dimensional linear models with fixed effects only. In particular, we demonstrate how to correct a ‘naive’ ridge estimator in extension of work by Bühlmann (2013) to build asymptotically valid confidence intervals for mixed effect models. We validate our theoretical results with numerical experiments, in which we show our method outperforms those that fail to account for correlation induced by the random effects. For a practical demonstration we consider a riboflavin production dataset that exhibits group structure, and show that conclusions drawn using our method are consistent with those obtained on a similar dataset without group structure.
1. Introduction
Modern statistical problems are increasingly high-dimensional, with the number of covariates p potentially vastly exceeding the sample size N. This is due in part to technological advances that facilitate data collection. For instance, we are now able to measure the expression of many genes in a given specimen at little cost. However, it often remains expensive to have many replicates/species to experiment on, resulting in N ≪ p.
Fortunately, significant progress has been made in developing rigorous statistical tools for tackling such problems. While earlier work largely targeted point estimation and/or variable selection, recent years have seen a number of proposals on how to also assign uncertainty, statistical significance and confidence in high-dimensional models. This is of great practical importance, particularly when interpretation of parameters and variables is of key priority.
Early attempts are highly varied in their approach. Stability selection was proposed by Meinshausen and Bühlmann (2010) as a generic method for controlling the expected number of false positive selections; with improvements given by Shah and Samworth (2013). Sample splitting, where a first subsample is used to screen, and a second subsample is used to perform inference (Wasserman and Roeder, 2009; Meinshausen et al., 2009) has also been explored. Taking an alternative approach, Lockhart et al. (2014), Tibshirani et al. (2014) and Lee et al. (2016) build a framework for conditional inference for high-dimensional linear models, i.e., conduct inference given some covariates have been selected.
In this paper, we propose an unconditional inferential framework for high-dimensional linear mixed effect models, with the goal of testing null hypotheses of the form
| (1.1) |
where is the vector of fixed effect regression coefficients, and G may be any subset of {1,…,p}. Of particular interest is the case G = {j}, i.e., testing if a single fixed effect coefficient is zero. A related goal is to construct confidence intervals for , j = 1,…,p. This problem arises naturally in many settings, as observations are rarely independent. A prime example is the analysis of longitudinal data, which is highly prevalent in clinical studies. In such settings mixed effect models are a natural extension of linear models for modeling data exhibiting group-structured dependence.
Our framework is inspired by a recent line of work that proposes de-biasing penalized estimators as an approach to inference for high-dimensional linear models with fixed effects only. There, the limiting distribution of the modified estimator is tractable and, thus, can be used to construct approximate p-values and confidence intervals. For example, in high-dimensional linear regression, Zhang and Zhang (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014) suggest de-sparsifying the lasso: starting with the biased lasso estimator, the authors ‘invert’ the corresponding Karuhn-Kush-Tucker (KKT) optimality conditions to form an estimator that is approximately unbiased for β* and normally distributed. By construction, the de-biased estimator can then be used to derive confidence intervals and p-values. Ning and Liu (2017) extended this strategy by developing a score test for inference in penalized M-estimators.
Our proposed method bears strongest resemblance to Bühlmann (2013). Developed for high-dimensional linear models, the framework of Bühlmann (2013) is similar to those put forth by Zhang and Zhang (2014), van de Geer et al. (2014), and Javanmard and Montanari (2014), except it uses ridge estimation as a starting point. While the overall framework we consider is similar, there are important differences in the specifics on how to correct — or rather, approximately correct — for the bias in the ridge estimator, and how to compute an approximation of the limiting distribution of the de-biased estimator, to construct p-values and confidence intervals for elements in β*. As will be evident later, these differences are direct results of having to cope with dependencies induced by the random effects in the linear mixed effect model. The naive treatment of ignoring the dependencies, as we demonstrate in numerical examples, leads to poor practical performance (particularly, when inverting estimator to obtain confidence intervals, the confidence intervals have insufficient coverage). We address this issue by introducing a two-stage procedure that yields consistent estimates of the parameters that determine these dependencies. While we describe a ridge-based framework, the methodology could be extended to make use of other high-dimensional estimators as the starting point for constructing a de-biased estimator.
Our decision to use a ridge estimator is based on simulation findings for standard linear models showing that while asymptotically optimal, confidence intervals from ℓ1-based de-biasing (Zhang and Zhang, 2014; van de Geer et al., 2014; Javanmard and Montanari, 2014) tend to have coverage problems in finite samples. Yu et al. (2018) similarly noticed that confidence intervals based on a de-biased ℓ1-estimator for high-dimensional Cox model had poorer than theoretical coverage in practice. Although its theoretical justification is similar, the ridge-based method of Bühlmann (2013) yields better finite-sample error control.
Our paper is organized as follows. The remainder of this section provides a brief overview of the subsequent notation. Section 2 makes explicit the form of the high-dimensional linear mixed effect model we are working with. In Section 3, we describe the details of our method: specifically, how it builds upon Bühlmann (2013) to accommodate dependence within groups induced by the random effects. We also present theory, along with the required assumptions, to justify it. Numerical experiments can be found in Section 4, followed by a practical application of the method in Section 5. We conclude with a discussion and elaborate on potential extensions in Section 6. Proofs are collected in the Appendix.
Notation
Matrices are written in upper-case bold-face and their entries in corresponding lower-case. So ajk is the (j, k)th entry of matrix . For j ∈ {1,…,n2} and J ⊆ {1,…,n2}, aj and AJ denote the jth column of A and the column-wise concatenation of columns in A indexed by the set J, respectively. The ith row of A is denoted a(i). For r ∈ [1, ∞], the ℓr norm of a vector is , and the induced norm of a matrix is . With this notation, ⫼A⫼2 is the spectral norm, ⫼A⫼1 the maximum absolute column sum of the matrix, and ⫼A⫼∞ the maximum absolute row sum of the matrix. We use ∥A∥r to denote the ℓr norm of the vectorization of A.
The projection of onto the linear space generated by the rows of A is denoted PA = A(ATA)−AT, where A− is the Moore-Penrose inverse of A. For square matrices A1 and A2 of the same dimensions, A1 ≤ A2 indicates that A2 – A1 is positive semi-definite.
For real-valued functions g1(x) and g2(x) defined on (0, ∞), we write g1(x) ≲ g2(x) if there is a constant c ∈ (0, ∞) such that g1(x) ≤ cg2(x), and g1(x) ≳ g2(x) if instead g1(x) ≥ cg2(x). We write g1(x) ≍ g2(x) if both g1(x) ≲ g2(x) and g1(x) ≳ g2(x). Then, g1(x) = o(g2(x)) if g1(x)/g2(x) → 0 as x → ∞, and g1(x) = O(g2(x)) if there is a c ∈ (0, ∞) such that ∣g1(x)∣ ≤ cg2(x) for all x large enough. The latter relations also apply when x is a vector, where x → ∞ is interpreted elementwise. Finally, if is a random variable and is some constant, we write ∣X – a∣ = oP(1) if X converges to a in probability, i.e., X →p a.
2. The linear mixed effect model
Consider M groups of observations of sizes n1,…,nM. Let m = 1,…, M be group indices, and let i = 1,…,nm index the observations within group m. Let N be the total number of observations, so . We may later assume, without loss of generality, that nm = n for all groups, or that, N = nM. The proposed framework allows for non-uniform group sizes with minor adjustments, so long as the group sizes are of the same order.
For group m ∈ {1,…,M}, we observe the response vector , generated as
| (2.1) |
with
, an unknown vector of fixed regression coefficients;
, m = 1,…,M vectors of group-specific random effects, with , Ψ* an unknown q × q positive definite covariance matrix;
errors for unknown σ*2, which are independent of v1,…,vM; and
and known design matrices.
By construction, β* represents effects shared across groups while vm, m = 1,…,M, represent group-specific deviations. It will be convenient to write the model more compactly. Define vectors , , , a stacked matrix , and Z = diag(Z1,…,ZM). Then we can write (2.1) as
| (2.2) |
Marginalizing out the random effects yields
| (2.3) |
where Ψ*(B) = IM×M ⊗ Ψ*. This implies that V(σ*2, Ψ*) is block-diagonal and observations belonging to different groups are independent. Thus, the inclusion of random effects only induces dependencies between observations belonging to the same group. We will be primarily working with the marginal form (2.3) in subsequent sections.
We study the presented model under the following assumptions:
High dimensions: We allow p, the number of fixed regression coefficients, to be possibly much larger than N. On the other hand, q, the number of random effect variables, is assumed to be of constant order, or at least smaller than n.
Sparsity of β*: We assume β* to be sparse in the sense that most of its elements are zero: a more precise specification on the level of sparsity required is detailed in Section 3.2.
Structure of Ψ*: Our paper primarily considers the scenario of Ψ* = τ*2Iq×q. However, our method, and corresponding theoretical results, can be extended to accommodate the more general scenario of Ψ* = D* where D* is a diagonal q × q matrix.
Standardization of design matrices: The design matrices X and Z are assumed fixed and standardized with for j ∈ {1,…,p} and for j ∈ {1,…,qM}.
3. A ridge-based inferential framework
We would like to test null hypotheses of the form (1.1), i.e., for all j ∈ G, for subsets G ⊂ {1,…,p}, and construct confidence intervals for . This section formally introduces our inferential framework. We first describe its foundation, the de-biased ridge estimator, and show how it can be used to accomplish these tasks. We then detail how to assemble the components needed to construct this de-biased ridge estimator and approximate its limiting distribution. Theoretical justification of our approach is provided along the way.
3.1. A de-biased ridge estimator
As in Bühlmann (2013), our starting point is the ridge estimator given by
| (3.1) |
This estimator is natural in models with homoscedastic and uncorrelated errors but in the linear mixed effect model, the random effects results in correlation. We thus refer to from (3.1) as the ‘naive’ ridge estimator. The estimator has a simple closed form expression,
| (3.2) |
where . It is straightforward to show that the ridge estimator is normally distributed with covariance matrix, multiplied by a factor of N,
| (3.3) |
As in Bühlmann (2013), we assume that the diagonal entries of satisfy
| (3.4) |
Likewise, we do not require (3.4) to be bounded away from 0 as a function of N or p. This condition, in fact, is fairly mild; it is only violated under special kinds of design matrices. To illustrate, define R ≡ rank(X) and let X = QDΓT be the singular value decomposition with left singular vectors satisfying QTQ = IN×N, a diagonal matrix with entries s1 ≥ … ≥ sN (i.e., singular values of X), and right singular vectors satisfying ΓTΓ = IN×N. Let νmin(A) and νmax(A) be the smallest and largest eigenvalue of any square matrix A, respectively. We can then show the following.
Lemma 1. Condition (3.4) holds if and only if X ≠ 0 and
| (3.5) |
In the high-dimensional case with R ≤ N < p, the parameter β* is not identifiable: many vectors satisfy Xβ* = Xθ. A natural parameter to consider, as noted in Shao and Deng (2012), is θ* = PXTβ* = XT(XXT)−Xβ* = ΓΓTβ*, the projection of β* onto the linear space generated by the rows of X. As it turns out, under condition (3.4), or equivalently (3.5), the ridge estimator is a reasonable proxy for θ* when λ is sufficiently small.
Proposition 2. Suppose that λ > 0 and (3.4), or equivalently, (3.5), holds. Then, under our linear mixed effect model from Section 2, the ridge estimator (3.2) satisfies
where refers to the smallest non-zero eigenvalue of .
Proposition 2, which is proven in the Appendix, implies that the bias in estimating θ* with is small when λ > 0 is sufficiently small. We explicitly quantify how small λ needs to be for the estimation bias to be smaller than the standard error of .
Corollary 3. Suppose that the ridge penalty parameter λ > 0 is chosen such that , and that condition (3.4), or equivalently, (3.5) holds. Then,
Our interest, however, lies in β*, not θ*. Thus, for to be useful, we need to adjust for the projection bias . By definition of θ*, one observes that
| (3.6) |
which, under the null hypothesis H0,j : , becomes,
| (3.7) |
The quantity can be approximated by
| (3.8) |
where is a consistent initial estimator of β* (and consistency occurs under additional assumptions). Consider then the corrected ridge estimator as a statistic for testing H0,j:
| (3.9) |
Assuming that , we can write
where
A rearrangement of the above set of equations yields
| (3.10) |
Then, from model (2.3), it follows that
| (3.11) |
The normalizing factors needed to bring the Wj to N(0, 1) scale are given by . The proof is straightforward.
Theorem 4. Suppose we choose the ridge penalty parameter λ > 0 such that
| (3.12) |
and assume that for our choice of , there exist constants Cj = Cj(N, p) such that
| (3.13) |
Then, under the null hypothesis, H0,j, for all w > 0,
| (3.14) |
where . In addition, for any sequence of subsets Gp ⊆ {1,…,p}, if H0,Gp is true, then for any w > 0,
| (3.15) |
In subsequent sections, we identify specific scalings of N and p such that Theorem 4 becomes applicable. Based on the asymptotic distributions in Theorem 4, we can construct p-values for testing H0,G, G ⊆ {1,…,p}. For testing the individual null hypothesis H0,j, we define the p-value for the two-sided alternative as
| (3.16) |
where Φ is the standard normal distribution function. For testing the group null hypothesis H0,G, ∣G∣ > 1, we define the p-value as
| (3.17) |
where W1,…,Wp are as in (3.11). From Theorem 4, we can derive the following corollary.
Corollary 5. Under the conditions in Theorem 4, for any α ∈ (0, 1), the following statements hold:
3.2. Consistent estimation of variance parameters
As presented, the de-biased ridge framework depends on the values of the unknown parameters σ*2 an τ*2. We employ a two-step approach to consistent estimation of these parameters.
Let be the support of β*, with cardinality d = ∣S∣. We use the Lasso estimator with an appropriate choice of tuning parameter λL to identify an initial guess of the elements (i.e., indices) in S. We define as our guess for the support S. By properties of the Lasso, , although, in general, may not be a good estimate of S.
-
Working with the (potentially misspecified) random effects model
(3.18) we apply Henderson’s Method III (Henderson, 1953) to form estimates and . Henderson’s Method III is particularly tractable theoretically and enables us to study consistency in the scenario where (3.18) is actually misspecified, i.e., . For a discussion of Henderson’s methods and the appeals of Method III, see (Searle, 1968).
In recent years Henderson’s methods have largely been supplanted by alternatives such as restricted maximum likelihood (REML) for variance component estimation (Harville, 1977); it is customary to refer to variances of random effects as variance components. We thus provide a brief overview of what Henderson’s Method III entails. Consider, first, the low-dimensional model (2.3) with p < N. To simplify the notation in the following explanation, we momentarily define . By not distinguishing between fixed and random effects, the idea behind Henderson’s methods is to match the differences in the reductions in the sum-of-squares between sub-models of (2.3) to its expected value, not unlike a method-of-moments approach. To elaborate, in fitting (2.3) to data y, the reduction in the sum of squares is
| (3.19) |
Likewise, the decrease in the sum of squares due to fitting the reduced model y = Xβ + ϵ is
| (3.20) |
The expected difference in the reductions is
| (3.21) |
Moreover,
| (3.22) |
Together, (3.21) and (3.22), when matching theoretical expectations to empirical averages, form a triangular system of linear equations, from which we derive and . We find
| (3.23) |
| (3.24) |
It is straightforward to see that the and generated from (3.23) and (3.24) are unbiased, presuming that the true model is y = Xβ + Zv + ϵ. For consistency, some additional assumptions are needed, which we will discuss later in this section.
Returning to our two-step procedure and high-dimensional setup, Step 1 identifies a candidate low-dimensional sub-model, which is used in Step 2 to obtain variance component estimates. We do not require the candidate model to encompass the truth; however, λL should be such that , from Step 1, reliably captures the indices of the ‘strong’ signals in β*. The idea is that missing ‘weak’ signals only negligibly affect the accuracy of and in Step 2. We now show that this two-step procedure yields consistent estimators and in the setting where N → ∞ to (specifically, n is fixed, but the number of groups M → ∞) and d2 logp/M = o(1), provided some additional technical assumptions hold. From here on, this will also be the scaling assumed for Theorem 4, as well as Corollary 5. We first present the assumptions necessary for consistency and then formally state the theorem.
For ξ > 1, define the cone
| (3.25) |
Assumption 1. For some constant ξ > 1,
| (3.26) |
with the sign-restricted version of (3.25).
The quantity ζ in (3.26) is defined more generally in Ye and Zhang (2010), where it is termed a sign-restricted cone invertibility factor (SCIF). We have the following lemma.
Lemma 6. Suppose Assumption 1 holds, and let λL be defined by (A.2) (or 3.28) for some small ε > 0 and ξ as in Assumption 1. If u* ≤ λL(ξ − 1)/(ξ + 1), then
| (3.27) |
In the proof of Lemma 6 (provided in the Appendix), SCIF naturally appears when deriving an upper bound for . Lemma 6 assumes that Assumption 1 is satisfied, and that λL in Step 1 is chosen such that
| (3.28) |
with ξ as in Assumption 1. It then establishes that
with probability exceeding 1 – ε, where ε > 0 can be taken arbitrarily small. A direct implication is that if the lemma’s conditions are satisfied, only includes indices corresponding to ‘weak’ signals in β* of magnitude less than 4ξλL/ζ(ξ + 1) = o(1) with close to certainty, which is part of what Step 1 sets out to achieve.
Assumption 2. There exists an integer N′ ≲ d such that for the same constant ξ > 1 as in Assumption 1,
| (3.29) |
where
| (3.30) |
and is the sparse upper eigenvalue of models disjoint with S.
Assumption 2 is needed to control the number of false positive selections in from Step 1. In particular, we have
Lemma 7. Suppose that Assumption 2 holds, and λL is defined according to (3.28). In the event that u* ≤ λL(ξ − 1)/(ξ + 1), .
Put simply, Lemma 7 claims that under Assumption 2 and our choice of λL from (3.28), the total number of false selections in Step 1 is bounded by N′, with probability exceeding 1 – ε. The proof is provided in the Appendix.
Assumption 3. Let be formed by joining any N′ columns in X with to the d support columns in X. For the same N′ as in Assumption 2,
| (3.31) |
| (3.32) |
and the qM singular values of , satisfy
| (3.33) |
By (3.31) in Assumption 3, the fixed data matrix Z has full column rank, and no column vector of Z can be represented as a linear combination of the column vectors of any ‘feasible’ assuming that λL is chosen according to (3.28). After all, N′ + d is the upper bound on the number of selected fixed effects with probability exceeding 1 – ε (Lemma 7). Additionally, by (3.32), the sum of the squared perpendicular distances between each column vector in Z and its projection onto the linear subspace spanned by the column vectors of feasible matrices is at least on the order of qM (substantial, given there are qM columns in Z). The latter half of Assumption 3 requires all columns of are ‘close’ to being linearly independent from one another and ‘contribute equally’ to its rank. In particular, note that (3.33) is satisfied if
| (3.34) |
It is thus clear that (3.31) and (3.32) imply that random effects must not be confounded with any ‘feasible’ set of fixed effects (from Step 1) while (3.33) implies that the random effects are not confounded from one another. Analogous conditions were shown to be necessary to prove consistency of REML estimators in Jiang (1996).
Assumption 4. For any j ∈ S such that , with λL defined as in (3.28), . Here, is the eigen-decomposition of (defined for this Assumption) with , where is formed by joining any N′ columns in X with to the d – 1 support (excluding j) columns in X. The N′ referenced here is the same as in Assumptions 2 and 3.
Assumption 4 requires that covariates corresponding to weak (but non-zero) signals in β* (for which we cannot quantify a bound on the probability they are to be included in ) are not too strongly correlated to covariates in nor covariates associated with the random effects. This somewhat resembles the irrepresentability conditions needed for model selection consistency in Lasso–see, e.g., Zhao and Yu (2006). However, the two assumptions are very different: Aside from differences in the quantities involved, a key difference is that the irrepresentability condition requires a very stringent upper bound on non-confounding between fixed effects, whereas Assumption 4 only requires boundedness. As shown in the numerical experiments in the Appendix, as the number of covariates and sparsity of the model vary, Assumption 4 is very likely to be satisfied with even small bounds, whereas the irrepresentability condition is increasingly less likely to hold.
We can now state our main result on consistency of variance component estimators, which validates our two-step procedure.
Theorem 8. Consider N, p → ∞ with n fixed, M → ∞. Furthermore, suppose p → ∞ with d2q log p/M = o(1). Suppose Assumptions 1-4 are satisfied and ΛL is chosen according to (3.28) with ε ∝ 1/p. Then, and are consistent for σ*2 and τ*2, respectively, i.e.,
| (3.35) |
Because and are both oP(1), we can use and as plug-in values for σ*2 and τ*2, respectively. From there, we can form a consistent estimator of Ω* and normalizing constants κj.
For practical applications, REML can be used as a substitute for Henderson’s Method III for Step 2. Theory for REML would be a possible avenue for further explorations.
3.3. An initial estimator for β* and our choice of Cj
To form , we consider the ordinary least-squares (OLS) fit restricted to , i.e.,
| (3.36) |
We proceed to demonstrate that the error is o(1) in ∓1 norm.
Assumption 5. For the same N′ as in Assumptions 2, 3, 4, the sparse lower eigenvalue for models containing S of cardinality smaller than d + N′ is constant and greater than 0,
Assumption 5, in conjunction with previous assumptions and choice of ΛL (3.28), can be used to control the ℓ1 norm of the estimation error .
Theorem 9. Suppose Assumptions 1-5 hold. Under the same conditions as in Theorem 8, for some universal constant C > 0,
| (3.37) |
with probability converging to 1 as N, p → ∞.
Theorem 9 implies that we have the following crude bound, based on Hölder’s inequality,
| (3.38) |
The following corollary is a direct consequence of the crude bound (3.38).
Corollary 10. Suppose the conditions in Theorem 9 are satisfied, and that d, the sparsity of β*, satisfies d ≤ C−1 (M/(q log p))η, with C as in Theorem 9 and η ∈ (0, 1/2). Then,
| (3.39) |
4. Numerical experiments
4.1. A practical choice for λL
In practical applications, we run into the issue of not being able to set λL according to (3.28), as it involves knowing τ* and σ*. However, we can derive a (slightly ad-hoc) approximation of what λL should be. Upon closer examination of the proof of Lemma 11, we can substitute the term σ*2 + τ*2qn with νmax(V(σ*, τ*)) = σ*2 + τ*2νmax(ZTZ). The latter can be approximated according to the following procedure, assuming that the ratio τ*/σ* is not too small:
-
Apply scaled lasso (Sun and Zhang, 2012) to obtain an initial ‘average’ noise estimate. The solution to the scaled lasso problem is characterized by
(4.1) with .
Take with
| (4.2) |
We provide a heuristic justification. Ignoring the finer details involved in the theory, for the scaled lasso, serves as a good approximation for , where we have defined ϵ* = y – Xβ*. In linear models, ϵ* holds i.i.d. observations drawn from a N(0, σ*2) distribution. By the law of large numbers, converges to σ*2 for large N. Under a heteroskedastic error model, with ϵ* independent and , we can match to its expectation, which is given by , so can be used to approximate the ‘average’ noise level. If ϵ* ~ N (0, V(σ*, τ*)), then using a similar expectation matching argument, we can expect to act as a surrogate for
| (4.3) |
which follows from the fact that ∥Γϵ*∥2 = ∥ϵ*∥2 for any N × N orthogonal matrix Γ (overloading Γ from (3.5)). What we actually need is σ*2 + τ*2νmax(ZTZ). Then in the scenario where ratio τ*2/σ*2 is not too small, ρZ from (4.2) should give us a choice of λL that is close to the desired one from (3.28). Our choice of λL is constructed according to the above procedure for all subsequent numerical experiments.
4.2. A look into p-values
Denote the ‘unblocked’ version of Z as Zu; i.e., Zu is a N × q matrix formed by row-wise concatenating the M diagonal blocks in Z. We generate data from model (2.1) according to following schemes, setting M = 25 and n = 6:
-
(M1) For p ∈ {300, 600}, q ∈ {1, 2}, we construct [X Zu] from N i.i.d. realizations from a distribution with Φ* = {ϖjk} a (p + q) × (p + q) matrix with . X and Z (the ‘blocked’ version) are then normalized such that and for all j. For b ∈ {0.5, 1}, we set the p-dimensional vector of fixed regression coefficients to
where, the first d ∈ {5, 10} entries of β are nonzero. The variance parameters σ* and τ* are set to 0.5 and 1 respectively.
(M2) Same as (M1) except with Φ* = I(p+q)×(p+q).
The numerical experiments are setup similarly to those in Bühlmann (2013) and Schelldorfer et al. (2011). We set the ridge penalty parameter λ to 1/N for all experiments. Additionally, we set Cj according to Corollary 10 with η = 0.005.
We first consider null hypotheses of the form
| (4.4) |
We consider decision rules based on a significance level α = 0.05, i.e., we reject H0,j if the event Ej = {ϱj ≤ 0.05} occurs, where ϱj is as defined in (3.16). Following Bühlmann (2013), we evaluate the performance of the tests based on the type I error, averaged over the non-support indices,
| (4.5) |
and the power, averaged over the support indices,
| (4.6) |
where denotes the empirical probability over 1000 simulations. The results, presented in Figure 1, suggest that type I error is well-controlled for all combinations of p, q, b and d for the two different models. Power is high in most scenarios, but appears to vary with the aforementioned quantities, noticeably decreasing with b. However, this is to be expected.
Figure 1:
Average power vs. average type I error for testing groups of coefficients under the two models for different combinations of p, q, b and d.
We also consider null hypotheses of the form
| (4.7) |
with G taken either to be {1,…,100} (G1), or {101,…,200} (G2). By construction, the hypothesis H0,G1 should be accepted while H0,G2 rejected. We consider decision rules based on a significance level α = 0.05 and reject H0,G if the event EG = {ϱG ≤ 0.05} occurs, with ϱG defined in (3.17). To evaluate the performance of these tests, we consider type I error and power, which can be represented by and , respectively, where again, denotes the empirical probability over 1000 simulations. Figure 2 visualizes the results.
Figure 2:
Average power vs. average type I error for testing groups of coefficients under the two models for different combinations of p, q, b and d.
4.3. Comparisons with existing methods
In this section, we conduct a short numerical example to examine whether one could ‘naively’ apply inferential procedures for high-dimensional linear models to obtain inference for parameters in mixed models.
Consider Model (M1) from Section 4.2 in the instance of p = 300 and q = 1. Let β* = [0.05, 2, 4, 3, 0.1, 0,…, 0]. We compare our method against
ridge-based inference procedure of Bühlmann (2013), which is an analogue of our method developed for high-dimensional linear models;
lasso-based inference procedure of van de Geer et al. (2014), which entailes de-sparsifying a lasso estimator.
The differences are fairly evident when comparing confidence interval coverage. For any α ∈ (0, 1), define as the α-th quantile of the distribution of Wj. Under the conditions of Theorem 4, if the assumed model is correct, (3.11) suggests that confidence intervals of the form
should guarantee coverage of at least (1 – α)%. Rather than setting Cj according to Corollary 10, we set them to be the same as the ‘Cj-analogues’ from Bühlmann (2013), to make the two methods comparable. Our choice of Cj are larger than theirs, so if anything, this ad-hoc decision provides Bühlmann (2013)’s method an unfair advantage. In Figure 3, we examine 95% confidence interval coverage for the three methods, based on the above modifications.
Figure 3:
Confidence interval coverage for , j = 1,…,p; target coverage is 95% (with 1000 simulations, the standard deviation is ~0.69%). Here, lmm (
) refers to our method; ridge (
) to the method of Bühlmann (2013); and lasso (
) to the method of van de Geer et al. (2014).
Overall, our method, which accounts for random effects, performs best at attaining the target guaranteed coverage across all , compared to the methods proposed in Bühlmann (2013) and van de Geer et al. (2014). While Bühlmann (2013)’s method does come close, coverage falls short at 16 indices: minimum coverage achieved was 92.9% (with 1000 simulations, this is a statistically significance difference from 0.95). At initial glance it appears that the lasso-based method from van de Geer et al. (2014) performs quite well; however, a closer examination of the results reveals otherwise. Specifically, the lasso-based method does very poorly over some of the active coefficients, as made evident in Table 1.
Table 1:
Confidence interval coverage for signals , j = 1,…, 5; target coverage is 95%.
| Our method | Bühlmann (2013) | van de Geer et al. (2014) | |
|---|---|---|---|
| 0.977 | 0.974 | 0.994 | |
| 0.973 | 0.963 | 0.865 | |
| 0.969 | 0.971 | 0.782 | |
| 0.971 | 0.972 | 0.886 | |
| 0.983 | 0.993 | 1.000 |
5. An application to riboflavin production data
In this section, we apply our proposed methodology to data on riboflavin (vitamin B2) production by Bacillus subtilis. The data is made publicly available by Bühlmann et al. (2014); the original data was provided by DSM (Switzerland). The dataset, referenced as riboflavinGrouped, has M = 28 specimens measured at two to six time points, resulting in N = 111 observations in total. For each specimen at each time point, we record a single real valued response variable, the log-transformed riboflavin production rate, as well as the expression levels of p = 4088 genes. We are interested in identifying which gene is significantly correlated with riboflavin production.
To account for correlations induced by repeated measurements, a natural model to consider is the random intercept model, in which we assume that
| (5.1) |
with vm, m = 1,…, M i.i.d. with vm ~ N(0, τ*2), and ϵm, m = 1,…, M, independent with ϵm ~ N(0,σ*2Inm×nm), and generated independently of v1,… vm. Note that (5.1) can be represented by (2.1) with the Zm’s taken to be column vectors of 1s of lengths nm. Most of the theoretical results assume the nm’s are equal, but it is straightforward to show the results hold so long as nm are on the same order of magnitude, as they are here.
We apply our proposed framework and compute the marginal p-values for testing . Controlling the family-wise error rate (FWER) at 5%, via a simple Bonferroni correction, we find a single significant gene in riboflavin production: YXLD-at. This result matches previous findings by Javanmard and Montanari (2014) and Meinshausen et al. (2009) using an homogeneous dataset with N = 71 samples provided by the same source (riboflavin in Bühlmann et al. (2014)). Like us, Meinshausen et al. (2009) makes a single discovery, YXLD-at, while Javanmard and Montanari (2014) also labels YXLE-at as significant. The method of Bühlmann (2013), on the other hand, makes no discoveries.
6. Discussion
We presented a new framework for constructing asymptotically valid p-values and confidence intervals for the fixed effects in high-dimensional linear mixed effect models. It entails de-biasing a ‘naive’ ridge estimator, whose asymptotic distribution we can approximate sufficiently well if the number of independent groups of observations M scales at least with d2q log p. Simulation studies in high-dimensional suggest that our method provides good control of type-I error. It also provides good results for a riboflavin dataset with group structure, where we confirmed results obtained in earlier work based on a homogeneous dataset from the same source (Javanmard and Montanari, 2014; Meinshausen et al., 2009).
Several extensions to our methodology would be of interest for future work. First, our proposal for selecting the tuning parameter λL relies on the assumption that τ*2/σ*2 is not too small. Although it appears to work well in practice, one could also consider an iterative scheme that repeatedly updates λL based on the resultant estimates of σ*2 and τ*2: this can be readily implemented in practice but may be difficult to validate theoretically. Second, here we required the number of random effects q to be quite small (treated as constant in the theory). This assumption can be relaxed by, e.g., taking Ψ* to be a general diagonal matrix, i.e., , and assuming that a small number of are nonzero, i.e., cardinality of is small, less than n. Then, instead of screening for fixed effects in Step 1, we can screen for both fixed and random effects by incorporating a double penalization scheme as in Li et al. (2018). This way, in Step 2, both and are small, and we can apply Henderson’s method III as before.
A few other details should also be discussed for completeness. First, multiple testing can be handled using the Westfall-Young procedure of Bühlmann (2013). This multiple testing adjustment, which strongly controls the family-wise error rate, can directly be used in conjunction with our method for generating p-values for the individual hypothesis tests. Second, the ridge-based framework of Bühlmann (2013), which is a basis for our method, is known to not have optimal power. Bühlmann (2013) shows that the detection rate may be larger than N−1/2, whereas, under certain conditions, the detection limit for the de-biased lasso approach of Zhang and Zhang (2014) is in the N−1/2 range. A possible extension of our work is to build a lasso-based inferential framework for high-dimensional linear mixed effect models. In fact, as suggested in the Introduction, our methods can be adapted to other high-dimensional estimators; and ridge is just an example. From van de Geer et al. (2014), we can obtain asymptotically optimal inference for linear fixed effect models—i.e., for y = Xβ* + ϵ with N observations and ϵi i.i.d N(0, σ*2)—by leveraging the fact that the Lasso estimator with non-negative penalty parameter λ, , can be rewritten as
by inverting the KKT conditions, with arising from the subdifferential of ∥β∥1. Taking to be a reasonably good approximation of an inverse of , the Δ term becomes asymptotically negligible, and we can use the normality of ϵ to develop asymptotically valid tests and confidence intervals for β*. (The scaled lasso furnishes a consistent estimator of σ*2.) Extending this approach to the linear mixed-effect setup (per Section 2) requires meeting the challenge that the ϵi are no longer i.i.d., which could be addressed using the methods of Section 3.2.
A. Appendix
A.1. Proof of Results in Section 3
A.1.1. Proof of Lemma 1
It is straightforward to show that Ω* can be lower bounded as
for some c satisfying 0 < c < νmin (V(σ*2, τ*2)). Since σ*2 is positive, νmin (V(σ*2, τ*2)) > 0. Note that can alternatively be written as
which, in turn, implies that
and the claim follows.
A.1.2. Proof of Proposition 2
This was proven in Shao and Deng (2012) (see proof of their Theorem 1). Define Γ = [Γ′ (Γ)⊥]; Γ′ is orthogonal, i.e., Γ′TΓ′ = Γ′Γ′T = Ip×p . By definition (3.2), we have
Observing that the diagonal entries to D are positive, one obtains
| (A.1) |
which, combined with the fact that ΓTΓ = IR×R, we obtain
as desired. The bound on the variance follows directly from (3.3).
A.2. Proof of Theorems 4, 8 and 9
We first establish Theorem 4, which follows directly from Proposition 2.
A.2.1. Proof of Theorem 4
It follows from Proposition 2 that
which, due to our choice of the ridge penalty parameter λ > 0 in (3.12), is o(1) as N, p α ∞. The claim now follows from (3.10) and the assumption given by (3.13).
Because there is an overlap in the lemmas used to prove Theorems 8 and 9, we present them together. Define u* = ∥XT(y – Xβ*)∥∞/N.
Lemma 11.Let| (A.2) |
Under the model given by (2.3), the event u* ≤ λL(ξ – 1)/(ξ + 1) occurs with probability greater than 1 – ε.
Proof. Define . Then u* = maxj∣uj∣. Under model (2.3), we observe that,
It follows from the Gaussianity of uj (in fact, sub-Gaussianity would suffice) that
| (A.3) |
The second inequality follows from the fact that the columns of X are standardized such that . For the third inequality, recall that the columns of Z are standardized such that , which implies that the largest eigenvalue of V(σ*2, τ*2), the true covariance of y, satisfies νmax(V(σ*, τ*)) ≤ σ*2 + τ*2qn. The third inequality in (A.3) is obtained by plugging in our choice of λL (3.28). Employing a union bound, we then have
This is our desired result.
A.2.2. Proof of Lemma 6
We use arguments similar to those employed in the proof of Theorem 3 in Ye and Zhang (2010). Suppose that u* ≤ λL. Define . The Karuhn-Kush-TUcker (KKT) optimality conditions for Lasso is given by
With some rearrangement, the KKT conditions can be rewritten as
| (A.4) |
with and if and ιj ∈ [−1, +1] otherwise: the subdifferential which arises from ∥β∥1. Rearranging (A.4) and observing that for j ∉ S yields
for all vectors h’ with . If we take h’ = h, one can see that :
On the other hand, setting h′ to be any vector so that for some j ∈ Sc, and 0 elsewhere gives
which implies that . The KKT conditions (A.4) also tell us that
which, when combined with the definition of ζ (3.26) yields
which is the desired result. In the event that u* ≤ λL(ξ – 1)/(ξ + 1), we have
A.2.3. Proof of Lemma 7
The proof is adapted from that of Theorem 3 in Sun and Zhang (2012). By construction, satisfies the KKT conditions from (A.4) which implies that
For , such that , the previous inequality implies
| (A.5) |
Going back to the KKT conditions in (A.4), we have, for arbitrary ,
which, when combined with the fact that
gives the inequality
| (A.6) |
Thus, h lies in the cone in (3.25) in the event that u* ≤ λL(ξ – 1)/(ξ + 1) (by noting that the left-hand side is lower bounded by 0). By definition of κ(ξ, S) from (3.30),
which, when combined with (A.5) implies
by Assumption 2.
A.2.4. Proof of Theorem 8
Suppose that u* ≤ λL(ξ − 1)/(ξ + 1). Then by Lemmas 6 and 7 and the referenced assumptions within, we have
| (A.7) |
| (A.8) |
Denote . Under candidate model (3.18), our variance component estimators (via Henderson’s Method III) are given by
Consider the more interesting scenario where . If S is contained within , then it is straightforward to show that variance component estimators are consistent (the true model is a submodel of the proposed one). We first prove, under the given assumptions, that . Write , ‘O’ for omitted. Then,
| (A.9) |
We proceed to show that the three parts to (A.9) satisfy
| (A.10) |
| (A.11) |
| (A.12) |
which would suggest that is indeed consistent for σ*2. Note that the term in (A.12) is .
-
Proving (A.11): Let represent the eigendecomposition of . We note that the latter is idempotent, implying that the diagonal matrix , which is of rank , has only 0 and 1s as its eigenvalues. It is straightforward to show that
Statement (A.11) then follows from Chebyshev’s inequality.
-
Proving (A.10): Let , , be i.i.d random variables following a χ2 distribution with 1 degree of freedom. Observe that,
(A.13) and(A.14) Here, (A.14) follows from Lemma 7 and the fact that N ≿ N′ + d + qM, which implies that as M α ∞. Applying the Strong Law of Large Numbers (SLLN) for i.i.d. random variables, we arrive at (A.10).
-
Proving (A.12): We observe that
and we have completed our proof that is consistent under the stated assumptions.
We now demonstrate that the same claim holds for . Expanding out y, we obtain, after some algebraic manipulation,
| (A.15) |
where we have defined . We set out to prove that the terms in (A.15) satisfy
| (A.16) |
| (A.17) |
| (A.18) |
| (A.19) |
| (A.20) |
and
| (A.21) |
Let represent the singular value decomposition of Z′, with QZ′ and ΓZ′ of dimensions N × qM and qM × qM, respectively. Additionally, write the eigendecompositions of and as and , respectively. Note that (A.19) and (A.21) make up .
To avoid repetition, some of the proofs are presented in abbreviated form.
-
Proving (A.16): Clearly,
following from (3.32) in Assumption 3 and Assumption 4.
-
Proving (A.17): Orthogonality of implies that , soand, using properties of quadratic forms, we have
the latter relation following from (3.33) in Assumption 3. This proves (A.17). - Proving (A.18): Proof is similar to that of (A.16), as we note that
having applied Assumptions 3 and 4 here. - Proving (A.19): As for (A.17), using again properties of quadratic forms, we can show that
the last relation the result of (3.33) from Assumption 3. -
Proving (A.20): We can rewrite
where Bi, i = 1,… rank(Z′) are random variables formed as the product of two independent N(0, 1) random variables. Then,
which implies thatand since by (3.32) in Assumption 3, we have proven our claim (A.20).
- Proving (A.21): By the definition of , it is clear that
where the second last relation follows from the proven claim that is o(1), (3.32) in Assumption 3 and Assumption 4.
Since the event u* ≤ λL(ξ − 1)/(ξ + 1) occurs with probability greater than 1 – 1/p → 1 as p → ∞,
as claimed.
A.2.5. Proof of Theorem 9
Suppose that u* ≤ λL(ξ − 1)/(ξ + 1), and write . The OLS fit has a simple closed-form expression:
and . Thus, by triangle inequality,
| (A.22) |
We proceed by first bounding the first term on the right-hand side of (A.22). By Assumption 5,
This, in turn, implies that
where the last relation follows from Assumption 2.
We proceed to bound the second component on the right-hand side of (A.22). We observe that
From Lemma 6,
which implies that
By Lemma 11, the event u* ≤ λL(ξ − 1)/(ξ + 1) occurs with probability exceeding 1 – 1/p. Combined, we obtain the desired result.
A.3. Empirical Evaluation of Assumption 4
To assess the stringency of Assumption 4 compared to the irrepresentability condition (Zhao and Yu, 2006), we conduct a simulation study similar to Zhao and Yu (2006), but customized to our mixed linear model setting.
Consider the model
with and , , and . We consider q = 2 random effects, M = 25 groups, and n = 20 samples within each group. Among the p fixed effect covariates, we set d to have nonzero coefficients and the rest to have zero coefficients. More specifically, we set .
To assess the stringency of the two assumptions, in each of B = 1000 simulation replications, we randomly generate design matrices X and Z jointly as [X, Zu] ~iid N(0, Σ), where Zu is the un-blocked version of Z. The covariance matrix Σ is generated from a Wishart(p+q, Ip+q) distribution, and X and Z are scaled such that and . Following Zhao and Yu (2006), we consider p = 2k for k ∈ {3, 4,…, 8}, and set d = tp/8 for t ∈ {1, 2,…, 8}.
Let A* be the index of the true active set (hence ∣A*∣ = d). Further, let S ⊂ (A*)c with ∣S∣ = min(p – d, p) be a random subset of variables with zero coefficients. For j ∈ A*, let be the augmented design matrix, and denote .
With the above notations, the irrepresentability condition is satisfied if TIR = maxj∈A* TIR,j < 1, where
Assumption 4 involves a related quantity, , where is the eigen-decomposition of . However, this assumption is satisfied if T4 ≡ maxj∈A* T4,j = O(1). Thus, to satisfy Assumption 4, we need a constant C, not dependent on N and p, such that T4 < C. While C can be any large but fixed constant, in this simulation we consider a moderate value of C = 5.
The proportion of simulated data sets, where the irrepresentability assumption and Assumption 4 are satisfied are shown in Tables 2 and 3, respectively. As in Zhao and Yu (2006), the results in Table 2 indicate that the irrepresentability assumption can be stringent, especially as the dimension p and the number of nonzero coefficients d increase. In contrast, the results in Table 3 suggest that, for C = 5, Assumption 4 is much more likely to hold. Moreover, the proportion of cases for which this assumption holds does not change with p or d. While the appropriate choice of C is generally unknown, the results in this simulation suggest that even with moderate values (in this case C = 5) Assumption 4 is likely satisfied.
Table 2:
Proportions of cases satisfying the irrepresentability condition (Zhao and Yu, 2006).
| TIR < 1 | p = 8 | p = 16 | p = 32 | p = 64 | p = 128 | p = 256 |
|---|---|---|---|---|---|---|
| d = p/8 | 1 | 1 | 0.975 | 0.823 | 0.327 | 0.022 |
| d = 2p/8 | 1 | 0.735 | 0.283 | 0.013 | 0 | 0 |
| d = 3p/8 | 0.736 | 0.231 | 0.014 | 0 | 0 | 0 |
| d = 4p/8 | 0.356 | 0.062 | 0.001 | 0 | 0 | 0 |
| d = 5p/8 | 0.244 | 0.024 | 0 | 0 | 0 | 0 |
| d = 6p/8 | 0.208 | 0.015 | 0 | 0 | 0 | 0 |
| d = 7p/8 | 0.199 | 0.012 | 0 | 0 | 0 | 0 |
Table 3:
Proportions of cases satisfying Assumption 4.
| T4 < 5 | p = 8 | p = 16 | p = 32 | p = 64 | p = 128 | p = 256 |
|---|---|---|---|---|---|---|
| d = p/8 | 0.998 | 0.999 | 0.997 | 0.996 | 0.994 | 0.99 |
| d = 2p/8 | 1 | 1 | 0.996 | 0.995 | 0.99 | 0.982 |
| d = 3p/8 | 0.998 | 0.999 | 0.994 | 0.992 | 0.985 | 0.973 |
| d = 4p/8 | 0.999 | 0.998 | 0.991 | 0.995 | 0.984 | 0.954 |
| d = 5p/8 | 0.996 | 0.995 | 0.993 | 0.989 | 0.976 | 0.938 |
| d = 6p/8 | 1 | 0.997 | 0.993 | 0.987 | 0.967 | 0.93 |
| d = 7p/8 | 0.996 | 0.995 | 0.995 | 0.988 | 0.956 | 0.927 |
References
- Bühlmann P. Statistical significance in high-dimensional linear models. Bernoulli, 19(4):1212–1242, 2013. ISSN 1350-7265. doi: 10.3150/12-BEJSP11. URL http://dx.doi.org.offcampus.lib.washington.edu/10.3150/12-BEJSP11. [DOI] [Google Scholar]
- Bühlmann P, Kalisch M, and Meier L. High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1:255–278, 2014. [Google Scholar]
- Harville DA. Maximum likelihood approaches to variance component estimation and to related problems. J. Amer. Statist. Assoc, 72(358):320–340, 1977. ISSN 0162-1459. URL http://links.jstor.org.offcampus.lib.washington.edu/sici?sici=0162-1459(197706)72:358<320:MLATVC>2.0.CO;2-9&origin=MSN. With a comment by J. N. K. Rao and a reply by the author. [Google Scholar]
- Henderson CR. Estimation of variance and covariance components. Biometrics, 9:226–252, 1953. ISSN 0006-341X. doi: 10.2307/3001853. URL http://dx.doi.org.offcampus.lib.washington.edu/10.2307/3001853. [DOI] [Google Scholar]
- Javanmard A and Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research, 15:2869–2909, 2014. URL http://jmlr.org/papers/v15/javanmard14a.html. [Google Scholar]
- Jiang J. REML estimation: asymptotic behavior and related topics. Ann. Statist, 24(1):255–286, 1996. ISSN 0090-5364. URL 10.1214/aos/1033066209. [DOI] [Google Scholar]
- Lee JD, Sun DL, Sun Y, and Taylor JE. Exact post-selection inference, with application to the lasso. Ann. Statist, 44(3):907–927, 2016. ISSN 0090-5364. doi: 10.1214/15-AOS1371. URL 10.1214/15-AOS1371. [DOI] [Google Scholar]
- Li Y, Wang S, Song PX-K, Wang N, Zhou L, and Zhu J. Doubly regularized estimation and selection in linear mixed-effects models for high-dimensional longitudinal data. Stat. Interface, 11(4):721–737, 2018. ISSN 1938-7989. doi: 10.4310/SII.2018.v11.n4.a15. URL 10.4310/SII.2018.v11.n4.a15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lockhart R, Taylor J, Tibshirani RJ, and Tibshirani R. A significance test for the lasso. Ann. Statist, 42(2):413–468, 2014. ISSN 0090-5364. doi: 10.1214/13-AOS1175. URL 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Meinshausen N and Bühlmann P. Stability selection. J. R. Stat. Soc. Ser. B Stat. Methodol, 72 (4):417–473, 2010. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2010.00740.x. URL 10.1111/j.1467-9868.2010.00740.x. [DOI] [Google Scholar]
- Meinshausen N, Meier L, and Bühlmann P. p-values for high-dimensional regression. J. Amer. Statist. Assoc, 104(488):1671–1681, 2009. ISSN 0162-1459. doi: 10.1198/jasa.2009.tm08647. URL 10.1198/jasa.2009.tm08647. [DOI] [Google Scholar]
- Ning Y and Liu H. A general theory of hypothesis tests and confidence regions for sparse high dimensional models. Ann. Statist, 45(1):158–195, 2017. ISSN 0090-5364. doi: 10.1214/16-AOS1448. URL 10.1214/16-AOS1448. [DOI] [Google Scholar]
- Schelldorfer J, Bühlmann P, and van de Geer S. Estimation for high-dimensional linear mixed-effects models using ℓ1-penalization. Scand. J. Stat, 38(2):197–214, 2011. ISSN 0303-6898. doi: 10.1111/j.1467-9469.2011.00740.x. URL 10.1111/j.1467-9469.2011.00740.x. [DOI] [Google Scholar]
- Searle SR. Another look at henderson’s methods of estimating variance components. Biometrics, 24(4):749–787, 1968. ISSN 0006341X, 15410420. URL http://www.jstor.org/stable/2528870. [Google Scholar]
- Shah RD and Samworth RJ. Variable selection with error control: another look at stability selection. J. R. Stat. Soc. Ser. B. Stat. Methodol, 75(1):55–80, 2013. ISSN 1369-7412. doi: 10.1111/j.1467-9868.2011.01034.x. URL 10.1111/j.1467-9868.2011.01034.x. [DOI] [Google Scholar]
- Shao J and Deng X. Estimation in high-dimensional linear models with deterministic design matrices. Ann. Statist, 40(2):812–831, 2012. ISSN 0090-5364. doi: 10.1214/12-AOS982. URL 10.1214/12-AOS982. [DOI] [Google Scholar]
- Sun T and Zhang C-H. Scaled sparse linear regression. Biometrika, 99(4):879–898, 2012. ISSN 0006-3444. doi: 10.1093/biomet/ass043. URL 10.1093/biomet/ass043. [DOI] [Google Scholar]
- Tibshirani RJ, Taylor J, Lockhart R, and Tibshirani R. Exact Post-Selection Inference for Sequential Regression Procedures. ArXiv e-prints, Jan. 2014. [Google Scholar]
- van de Geer S, Bühlmann P, Ritov Y, and Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. Ann. Statist, 42(3):1166–1202, 2014. ISSN 0090-5364. doi: 10.1214/14-AOS1221. URL 10.1214/14-AOS1221. [DOI] [Google Scholar]
- Wasserman L and Roeder K. High-dimensional variable selection. Ann. Statist, 37(5A):2178–2201, 2009. ISSN 0090-5364. doi: 10.1214/08-AOS646. URL 10.1214/08-AOS646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ye F and Zhang C-H. Rate minimaxity of the lasso and dantzig selector for the lq loss in lr balls. J. Mach. Learn. Res, 11:3519–3540, Dec. 2010. ISSN 1532-4435. URL http://dl.acm.org/citation.cfm?id=1756006.1953043. [Google Scholar]
- Yu Y, Bradic J, and Samworth RJ. Confidence intervals for high-dimensional Cox models. ArXiv e-prints, Sept. 2018. [Google Scholar]
- Zhang C-H and Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. J. R. Stat. Soc. Ser. B. Stat. Methodol, 76(1):217–242, 2014. ISSN 1369-7412. doi: 10.1111/rssb.12026. URL 10.1111/rssb.12026. [DOI] [Google Scholar]
- Zhao P and Yu B. On model selection consistency of lasso. Journal of Machine learning research, 7(Nov):2541–2563, 2006. [Google Scholar]



