Abstract
We provide a general theory of the expectation-maximization (EM) algorithm for inferring high dimensional latent variable models. In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. With an appropriate initialization, this algorithm converges at a geometric rate and attains an estimator with the (near-)optimal statistical rate of convergence. (ii) Based on the obtained estimator, we propose new inferential procedures for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. For a broad family of statistical models, our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions. Our theory is supported by thorough numerical results.
1 Introduction
The expectation-maximization (EM) algorithm (Dempster et al., 1977) is the most popular approach for calculating the maximum likelihood estimator of latent variable models. Nevertheless, due to the nonconcavity of the likelihood function of latent variable models, the EM algorithm generally only converges to a local maximum rather than the global one (Wu, 1983). On the other hand, existing statistical guarantees for latent variable models are only established for global optima (Bartholomew et al., 2011). Therefore, there exists a gap between computation and statistics.
Significant progress has been made toward closing the gap between the local maximum attained by the EM algorithm and the maximum likelihood estimator (Wu, 1983; Tseng, 2004; McLachlan and Krishnan, 2007; Chrétien and Hero, 2008; Balakrishnan et al., 2014). In particular, Wu (1983) first establishes general sufficient conditions for the convergence of the EM algorithm. Tseng (2004); Chrétien and Hero (2008) further improve this result by viewing the EM algorithm as a proximal point method applied to the Kullback-Leibler divergence. See McLachlan and Krishnan (2007) for a detailed survey. More recently, Balakrishnan et al. (2014) establish the first result that characterizes explicit statistical and computational rates of convergence for the EM algorithm. They prove that, given a suitable initialization, the EM algorithm converges at a geometric rate to a local maximum close to the maximum likelihood estimator. All these results are established in the low dimensional regime where the dimension d is much smaller than the sample size n.
In high dimensional regimes where the dimension d is much larger than the sample size n, there exists no theoretical guarantee for the EM algorithm. In fact, when d ≫ n, the maximum likelihood estimator is in general not well defined, unless the models are carefully regularized by sparsity-type assumptions. Furthermore, even if a regularized maximum likelihood estimator can be obtained in a computationally tractable manner, establishing the corresponding statistical properties, especially asymptotic normality, can still be challenging because of the existence of high dimensional nuisance parameters. To address such a challenge, we develop a general inferential theory of the EM algorithm for parameter estimation and uncertainty assessment of high dimensional latent variable models. In particular, we make two contributions in this paper:
For high dimensional parameter estimation, we propose a novel high dimensional EM algorithm by attaching a truncation step to the expectation step (E-step) and maximization step (M-step). Such a truncation step effectively enforces the sparsity of the attained estimator and allows us to establish significantly improved statistical rate of convergence.
Based upon the estimator attained by the high dimensional EM algorithm, we propose a family of decorrelated score and Wald statistics for testing hypotheses for low dimensional components of the high dimensional parameter. The decorrelated Wald statistic can be further used to construct optimal valid confidence intervals for low dimensional parameters of interest.
Under a unified analytic framework, we establish simultaneous statistical and computational guarantees for the proposed high dimensional EM algorithm and the respective uncertainty assessment procedures. Let β* ∈ ℝd be the true parameter, s* be its sparsity level and be the iterative solution sequence of the high dimensional EM algorithm with T being the total number of iterations. In particular, we prove that:
- Given an appropriate initialization βinit with relative error upper bounded by a constant κ ∈ (0, 1), i.e., ‖βinit − β*‖2/‖β*‖2 ≤ κ, the iterative solution sequence satisfies
with high probability. Here ρ ∈(0, 1), and Δ1, Δ2 are quantities that possibly depend on ρ, κ and β*. As the optimization error term in (1.1) decreases to zero at a geometric rate with respect to t, the overall estimation error achieves the statistical rate of convergence (up to an extra factor of log n), which is (near-)minimax-optimal. See Theorem 3.4 and the corresponding discussion for details.(1.1) The proposed decorrelated score and Wald statistics are asymptotically normal. Moreover, their limiting variances and the size of the respective confidence interval are optimal in the sense that they attain the semiparametric information bound for the low dimensional components of interest in the presence of high dimensional nuisance parameters. See Theorems 4.6 and 4.7 for details.
Our framework allows two implementations of the M-step: the exact maximization versus approximate maximization. The former one calculates the maximizer exactly, while the latter one conducts an approximate maximization through a gradient ascent step. Our framework is quite general. We illustrate its effectiveness by applying it to three high dimensional latent variable models, including Gaussian mixture model, mixture of regression model and regression with missing covariates.
Comparison with Related Work
A closely related work is by Balakrishnan et al. (2014), which considers the low dimensional regime where d is much smaller than n. Under certain initialization conditions, they prove that the EM algorithm converges at a geometric rate to some local optimum that attains the statistical rate of convergence. They cover both maximization and gradient ascent implementations of the M-step, and establish the consequences for the three latent variable models considered in our paper under low dimensional settings. Our framework adopts their view of treating the EM algorithm as a perturbed version of gradient methods. However, to handle the challenge of high dimensionality, the key ingredient of our framework is the truncation step that enforces the sparsity structure along the solution path. Such a truncation operation poses significant challenges for both computational and statistical analysis. In detail, for computational analysis we need to carefully characterize the evolution of each intermediate solution’s support and its effects on the evolution of the entire iterative solution sequence. For statistical analysis, we need to establish a fine-grained characterization of the entrywise statistical error, which is technically more challenging than just establishing the ℓ2-norm error employed by Balakrishnan et al. (2014). In high dimensional regimes, we need to establish the statistical rate of convergence, which is much sharper than their rate when d ≫ n. In addition to point estimation, we further construct confidence intervals and hypothesis tests for latent variable models in the high dimensional regime, which have not been established before.
High dimensionality poses significant challenges for assessing the uncertainty (e.g., constructing confidence intervals and testing hypotheses) of the constructed estimators. For example, Knight and Fu (2000) show that the limiting distribution of the Lasso estimator is not Gaussian even in the low dimensional regime. A variety of approaches have been proposed to correct the Lasso estimator to attain asymptotic normality, including the debiasing method (Javanmard and Montanari, 2014), the desparsification methods (Zhang and Zhang, 2014; van de Geer et al., 2014) as well as instrumental variable-based methods (Belloni et al., 2012, 2013, 2014). Meanwhile, Lockhart et al. (2014); Taylor et al. (2014); Lee et al. (2013) propose the post-selection procedures for exact inference. In addition, several authors propose methods based on data splitting (Wasserman and Roeder, 2009; Meinshausen et al., 2009), stability selection (Meinshausen and Bühlmann, 2010) and ℓ2-confidence sets (Nickl and van de Geer, 2013). However, these approaches mainly focus on generalized linear models rather than latent variable models. In addition, their results heavily rely on the fact that the estimator is a global optimum of a convex program. In comparison, our approach applies to a much broader family of statistical models with latent structures. For these latent variable models, it is computationally infeasible to obtain the global maximum of the penalized likelihood due to the nonconcavity of the likelihood function. Unlike existing approaches, our inferential theory is developed for the estimator attained by the proposed high dimensional EM algorithm, which is not necessarily a global optimum to any optimization formulation.
Another line of research for the estimation of latent variable models is the tensor method, which exploits the structures of third or higher order moments. See Anandkumar et al. (2014a,b,c) and the references therein. However, existing tensor methods primarily focus on the low dimensional regime where d≪n. In addition, since the high order sample moments generally have a slow statistical rate of convergence, the estimators obtained by the tensor methods usually have a suboptimal statistical rate even for d≪n. For example, Chaganty and Liang (2013) establish the statistical rate of convergence for mixture of regression model, which is suboptimal compared with the minimax lower bound. Similarly, in high dimensional settings, the statistical rates of convergence attained by tensor methods are significantly slower than the statistical rate obtained in this paper.
The three latent variable models considered in this paper have been well studied. Nevertheless, only a few works establish theoretical guarantees for the EM algorithm. In particular, for Gaussian mixture model, Dasgupta and Schulman (2000, 2007); Chaudhuri et al. (2009) establish parameter estimation guarantees for the EM algorithm and its extensions. For mixture of regression model, Yi et al. (2013) establish exact parameter recovery guarantees for the EM algorithm under a noiseless setting. For high dimensional mixture of regression model, Städler et al. (2010) analyze the gradient EM algorithm for the ℓ1-penalized log-likelihood. They establish support recovery guarantees for the attained local optimum but have no parameter estimation guarantees. In comparison with existing works, this paper establishes a general inferential framework for simultaneous parameter estimation and uncertainty assessment based on a novel high dimensional EM algorithm. Our analysis provides the first theoretical guarantee of parameter estimation and asymptotic inference in high dimensional regimes for the EM algorithm and its applications to a broad family of latent variable models.
Notation
Let A = [Ai,j] ∈ ℝd×d and v = (υ1, …, υd)⊤ ∈ ℝd. We define the ℓq-norm (q≥1) of v as . Particularly, ‖v‖0 denotes the number of nonzero entries of v. For q ≥ 1, we define ‖A‖q as the operator norm of A. Specifically, ‖A‖2 is the spectral norm. For a set 𝒮, |𝒮| denotes its cardinality. We denote the d×d identity matrix as Id. For index sets ℐ, 𝒥 ⊆ {1, …, d}, we define Aℐ,𝒥 ∈ℝd×d to be the matrix whose (i, j)-th entry is equal to Ai,j if i∈ℐ and j∈𝒥, and zero otherwise. We define vℐ similarly. We denote ⊗ and ⊙ to be the outer product and Hadamard product between vectors. The matrix (p, q)-norm, i.e., ‖A‖p,q, is obtained by taking the ℓp-norm of each row and then taking the ℓq-norm of the obtained row norms. Let supp(v) be the support of v, i.e., the index set of its nonzero entries. We use C,C′, … to denote generic absolute constants. The values of these constants may vary from line to line. In addition, we denote ‖·‖ψq (q ≥ 1) to be the Orlicz norm of random variables. We will introduce more notations in §2.2.
The rest of the paper is organized as follows. In §2 we present the high dimensional EM algorithm and the corresponding procedures for inference, and then apply them to three latent variable models. In §3 and §4, we establish the main theoretical results for computation, parameter estimation and asymptotic inference, as well as their implications for specific latent variable models. In §5 we sketch the proof of the main results. In §6 we present the numerical results. In §7 we conclude the paper.
2 Methodology
We first introduce the high dimensional EM Algorithm. Then we present the respective procedures for asymptotic inference. Finally, we apply the proposed methods to three latent variable models.
2.1 High Dimensional EM Algorithm
Before we introduce the proposed high dimensional EM Algorithm (Algorithm 1), we briefly review the classical EM algorithm. Let hβ(y) be the probability density function of Y ∈ 𝒴, where β ∈ ℝd is the model parameter. For latent variable models, we assume that hβ(y) is obtained by marginalizing over an unobserved latent variable Z ∈ 𝒵, i.e.,
| (2.1) |
Given the n observations y1, …, yn of Y, the EM algorithm aims at maximizing the log-likelihood
| (2.2) |
Due to the unobserved latent variable Z, it is difficult to directly evaluate ℓn(β). Instead, we turn to consider the difference between ℓn(β) and ℓn(β′). Let kβ(z | y) be the density of Z conditioning on the observed variable Y = y, i.e.,
| (2.3) |
According to (2.1) and (2.2), we have
| (2.4) |
where the third equality follows from (2.3) and the inequality is obtained from Jensen’s inequality. On the right-hand side of (2.4) we have
| (2.5) |
We define the first term on the right-hand side of (2.5) to be Qn(β; β′). Correspondingly, we define its expectation to be Q(β; β′). Note the second term on the right-hand side of (2.5) doesn’t depend on β. Hence, given some fixed β′, we can maximize the lower bound function Qn(β; β′) over β to obtain a sufficiently large ℓn(β) − ℓn(β′). Based on such an observation, at the t-th iteration of the classical EM algorithm, we evaluate Qn(β; β(t)) at the E-step and then perform maxβ Qn(β; β(t)) at the M-step. See McLachlan and Krishnan (2007) for more details.
Algorithm 1.
High Dimensional EM Algorithm
| 1: | Parameter: Sparsity Parameter ŝ, Maximum Number of Iterations T |
| 2: | Initialization: 𝒮̂init ← supp(βinit, ŝ), β(0) ← trunc(βinit, 𝒮̂init) |
| {supp(·, ·) and trunc(·, ·) are defined in (2.6) and (2.7)} | |
| 3: | For t = 0 to T − 1 |
| 4: | E-step: Evaluate Qn(β; β(t)) |
| 5: | M-step: β(t+0.5) ← Mn(β(t)) {Mn(·) is implemented as in Algorithm 2 or 3} |
| 6: | T-step: 𝒮̂(t+0.5) ← supp(β(t+0.5), ŝ), β(t+1) ← trunc(β(t+0.5), 𝒮̂(t+0.5)) |
| 7: | End For |
| 8: | Output: β̂ ← β(T) |
The proposed high dimensional EM algorithm (Algorithm 1) is built upon the E-step and M-step (lines 4 and 5) of the classical EM algorithm. In addition to the exact maximization implementation of the M-step (Algorithm 2), we allow the gradient ascent implementation of the M-step (Algorithm 3), which performs an approximate maximization via a gradient ascent step. To handle the challenge of high dimensionality, in line 6 of Algorithm 1 we perform a truncation step (T-step) to enforce the sparsity structure. In detail, we define the supp(·, ·) function in line 6 as
| (2.6) |
Also, for an index set 𝒮 ⊆ {1, …, d}, we define the trunc(·, ·) function in line 6 as
| (2.7) |
Note that β(t+0.5) is the output of the M-step (line 5) at the t-th iteration of the high dimensional EM algorithm. To obtain β(t+1), the T-step (line 6) preserves the entries of β(t+0.5) with the top ŝ large magnitudes and sets the rest to zero. Here ŝ is a tuning parameter that controls the sparsity level (line 1). By iteratively performing the E-step, M-step and T-step, the high dimensional EM algorithm attains an ŝ-sparse estimator β̂ = β(T) (line 8). Here T is the total number of iterations.
It is worth noting that, the truncation strategy employed here and its variants are widely adopted in the context of sparse linear regression and sparse principal component analysis. For example, see Yuan and Zhang (2013); Yuan et al. (2013) and the references therein. Nevertheless, we incorporate this truncation strategy into the EM algorithm for the first time. Also, our analysis is significantly different from existing works.
Algorithm 2.
Maximization Implementation of the M-step
| 1: | Input: β(t), Qn(β; β(t)) |
| 2: | Output: Mn(β(t)) ← argmaxβ Qn(β; β(t)) |
Algorithm 3.
Gradient Ascent Implementation of the M-step
| 1: | Input: β(t), Qn(β; β(t)) |
| 2: | Parameter: Stepsize η > 0 |
| 3: | Output: Mn(β(t)) ← β(t) + η · ∇Qn(β(t); β(t)) |
| {The gradient is taken with respect to the first β(t) in Qn(β(t); β(t))} |
2.2 Asymptotic Inference
In the sequel, we first introduce some additional notations. Then we present the proposed methods for asymptotic inference in high dimensions.
Notation
Let ∇1Q(β; β′) be the gradient with respect to β and ∇2Q(β; β′) be the gradient with respect to β′. If there is no confusion, we simply denote ∇Q(β; β′) = ∇1Q(β; β′) as in the previous sections. We define the higher order derivatives in the same manner, e.g., is calculated by first taking derivative with respect to β and then with respect to β′. For with β1 ∈ ℝd1, β2 ∈ ℝd2 and d1 + d2 = d, we use notations such as vβ1 ∈ ℝd1 and Aβ1,β2 ∈ ℝd1×d2 to denote the corresponding subvector of v ∈ ℝd and the submatrix of A ∈ ℝd×d.
We aim to conduct asymptotic inference for low dimensional components of the high dimensional parameter β*. Without loss of generality, we consider a single entry of β*. In particular, we assume β* = [α*, (γ*)⊤]⊤, where α* ∈ ℝ is the entry of interest, while γ* ∈ ℝd−1 is treated as the nuisance parameter. In the following, we construct two hypothesis tests, namely, the decorrelated score and Wald tests. Based on the decorrelated Wald test, we further construct valid confidence interval for α*. It is worth noting that, our method and theory can be easily generalized to perform statistical inference for an arbitrary low dimensional subvector of β*.
Decorrelated Score Test
For score test, we are primarily interested in testing H0 : α* = 0, since this null hypothesis characterizes the uncertainty in variable selection. Our method easily generalizes to H0 : α* = α0 with α0 ≠ 0. For notational simplicity, we define the following key quantity
| (2.8) |
Let β = (α, γ⊤)⊤. We define the decorrelated score function Sn(·, ·) ∈ ℝ as
| (2.9) |
Here w(β, λ) ∈ ℝd−1 is obtained using the following Dantzig selector (Candès and Tao, 2007)
| (2.10) |
where λ > 0 is a tuning parameter. Let β̂ = (α̂, γ̂⊤)⊤, where β̂ is the estimator attained by the high dimensional EM algorithm (Algorithm 1). We define the decorrelated score statistic as
| (2.11) |
Here we use β̂0 instead of β̂ since we are interested in the null hypothesis H0 : α* = 0. We can also replace β̂0 with β̂ and the theoretical results will remain the same. In §4 we will prove the proposed decorrelated score statistic in (2.11) is asymptotically N(0, 1). Consequently, the decorrelated score test with significance level δ ∈ (0, 1) takes the form
where Φ−1(·) is the inverse function of the Gaussian cumulative distribution function. If ψS(δ) = 1, we reject the null hypothesis H0 : α* = 0. Correspondingly, the associated p-value takes the form
The intuition for the decorrelated score statistic in (2.11) can be understood as follows. Since ℓn(β) is the log-likelihood, its score function is ∇ℓn(β) and the Fisher information at β* is I(β*) = −𝔼 β* [∇2ℓn(β*)]/n, where 𝔼β*(·) means the expectation is taken under the model with parameter β*. The following lemma reveals the connection of ∇1Qn(·; ·) in (2.9) and Tn(·) in (2.11) with the score function and Fisher information.
Lemma 2.1
For the true parameter β* and any β ∈ ℝd, it holds that
| (2.12) |
Proof
See §C.1 for details.
Recall that the log-likelihood ℓn(β) defined in (2.2) is difficult to evaluate due to the unobserved latent variable. Lemma 2.1 provides a feasible way to calculate or estimate the corresponding score function and Fisher information, since Qn(·; ·) and Tn(·) have closed forms. The geometric intuition behind Lemma 2.1 can be understood as follows. By (2.4) and (2.5) we have
| (2.13) |
By (2.12), both sides of (2.13) have the same gradient with respect to β at β′ = β. Furthermore, by (2.5), (2.13) becomes an equality for β′ = β. Therefore, the lower bound function on the right-hand side of (2.13) is tangent to ℓn(β) at β′ = β. Meanwhile, according to (2.8), Tn(β) defines a modified curvature for the right-hand side of (2.13), which is obtained by taking derivative with respect to β, then setting β′ = β and taking the second order derivative with respect to β. The second equation in (2.12) shows that the obtained curvature equals the curvature of ℓn(β) at β = β* in expectation (up to a renormalization factor of n). Therefore, ∇1Qn(β; β) gives the score function and Tn(β*) gives a good estimate of the Fisher information I(β*). Since β* is unknown in practice, later we will use Tn(β̂) or Tn(β̂0) to approximate Tn(β*).
In the presence of the high dimensional nuisance parameter γ* ∈ ℝd−1, the classical score test is no longer applicable. In detail, the score test for H0 : α* = 0 relies on the following Taylor expansion of the score function ∂ℓn(·)/∂α
| (2.14) |
Here β* = [0, (γ*)⊤]⊤, R̅ denotes the remainder term and β̅0 = (0, γ̅⊤)⊤, where γ̅ is an estimator of the nuisance parameter γ*. The asymptotic normality of in (2.14) relies on the fact that and are jointly normal asymptotically and R̅ is oℙ(1). In low dimensional settings, such a necessary condition holds for γ̅ being the maximum likelihood estimator. However, in high dimensional settings, the maximum likelihood estimator can’t guarantee that R̅ is oℙ(1), since ‖γ̅ − γ*‖2 can be large due to the curse of dimensionality. Meanwhile, for γ̅ being sparsity-type estimators, in general the asymptotic normality of doesn’t hold. For example, let γ̅ be γ̂, where γ̂ ∈ ℝd−1 is the subvector of β̂, i.e., the estimator attained by the proposed high dimensional EM algorithm. Note that γ̂ has many zero entries due to the truncation step. As n → ∞, some entries of have limiting distributions with point mass at zero. Clearly, this limiting distribution is not Gaussian with nonzero variance. In fact, for a similar setting of high dimensional linear regression, Knight and Fu (2000) illustrate that for γ♯ being a subvector of the Lasso estimator and γ* being the corresponding subvector of the true parameter, the limiting distribution of is not Gaussian.
The decorrelated score function defined in (2.9) successfully addresses the above issues. In detail, according to (2.12) in Lemma 2.1 we have
| (2.15) |
Intuitively, if we replace w(β̂0, λ) with w ∈ ℝd−1 that satisfies
| (2.16) |
we have the following Taylor expansion of the decorrelated score function
| (2.17) |
where term (ii) is zero by (2.16). Therefore, we no longer require the asymptotic normality of γ̂ − γ*. Also, we will prove that the new remainder term R̃ in (2.17) is oℙ(1), since γ̂ has a fast statistical rate of convergence. Now we only need to find the w that satisfies (2.16). Nevertheless, it is difficult to calculate the second order derivatives in (2.16), because it is hard to evaluate ℓn(·). According to (2.12), we use the submatrices of Tn(·) to approximate the derivatives in (2.16). Since [Tn(β)]γ,γ is not invertible in high dimensions, we use the Dantzig selector in (2.10) to approximately solve the linear system in (2.16). Based on this intuition, we can expect that is asymptotically normal, since term (i) in (2.17) is a (rescaled) average of n i.i.d. random variables for which we can apply the central limit theorem. Besides, we will prove that −[Tn(β̂0)]α|γ in (2.11) is a consistent estimator of ’s asymptotic variance. Hence, we can expect that the decorrelated score statistic in (2.11) is asymptotically N(0, 1).
From a high-level perspective, we can view w(β̂0, λ)⊤·∂ℓn(β̂0)/∂γ in (2.15) as the projection of ∂ℓn(β̂0)/∂α onto the span of ∂ℓn(β̂0)/∂γ, where w(β̂0, λ) is the projection coefficient. Intuitively, such a projection guarantees that in (2.15), Sn(β̂0, λ) is orthogonal (uncorrelated) with ∂ℓn(β̂0)/∂γ, i.e., the score function with respect to the nuisance parameter γ. In this way, the projection corrects the effects of the high dimensional nuisance parameter. According to this intuition of decorrelation, we name Sn(β̂0, λ) as the decorrelated score function.
Decorrelated Wald Test
Based on the decorrelated score test, we propose the decorrelated Wald test. In detail, recall that Tn(·) is defined in (2.8). Let
| (2.18) |
where Sn(·, ·) is the decorrelated score function in (2.9), w(·, ·) is defined in (2.10) and β̂ = (α̂, γ̂⊤)⊤ is the estimator obtained by Algorithm 1. For testing the null hypothesis H0 : α* = 0, we define the decorrelated Wald statistic as
| (2.19) |
where [Tn(β̂)]α|γ is defined by replacing β̂0 with β̂ in (2.11). In §4 we will prove that this statistic is asymptotically N(0, 1). Consequently, the decorrelated Wald test with significance level δ ∈ (0, 1) takes the form
where Φ−1(·) is the inverse function of the Gaussian cumulative distribution function. If ψW(δ) = 1, we reject the null hypothesis H0 : α* = 0. The associated p-value takes the form
In more general settings where α* is possibly nonzero, in §4.1 we will prove that . {−[Tn(β̂)]α|γ}1/2 is asymptotically N(0, 1). Hence, we construct the two-sided confidence interval for α* with confidence level 1 − δ as
| (2.20) |
The intuition for the decorrelated Wald statistic defined in (2.19) can be understood as follows. For notational simplicity, we define
| (2.21) |
By the definitions in (2.9) and (2.21), we have S̅n(β̂, ŵ) = Sn(β̂, λ). According to the same intuition for the asymptotic normality of in the decorrelated score test, we can easily establish the asymptotic normality of . Based on the proof for the classical Wald test (van der Vaart, 2000), we can further establish the asymptotic normality of , where α̱ is defined as the solution to
| (2.22) |
Nevertheless, it is difficult to calculate α̱. Instead, we consider the first order Taylor approximation
| (2.23) |
Here ŵ is defined in (2.21) and the gradient is taken with respect to β in (2.21). According to (2.8) and (2.21), we have that in (2.23),
Hence, α̅(β̂, λ) defined in (2.18) is the solution to (2.23). Alternatively, we can view α̅(β̂, λ) as the output of one Newton-Raphson iteration applied to α̂. Since (2.23) approximates (2.22), intuitively α̅(β̂, λ) has similar statistical properties as α̱, i.e., the solution to (2.22). Therefore, we can expect that is asymptotically normal. Besides, we will prove that is a consistent estimator of the asymptotic variance of . Thus, the decorrelated Wald statistic in (2.19) is asymptotically N(0, 1). It is worth noting that, although the intuition of the decorrelated Wald statistic is the same as the one-step estimator proposed by Bickel (1975), here we don’t require the -consistency of the initial estimator α̂ to achieve the asymptotic normality of α̅(β̂, λ).
2.3 Applications to Latent Variable Models
In the sequel, we introduce three latent variable models as examples. To apply the high dimensional EM algorithm in §2.1 and the methods for asymptotic inference in §2.2, we only need to specify the forms of Qn(·; ·) defined in (2.5), Mn(·) in Algorithms 2 and 3, and Tn(·) in (2.8) for each model.
Gaussian Mixture Model
Let y1, …, yn be the n i.i.d. realizations of Y ∈ ℝd and
| (2.24) |
Here Z is a Rademacher random variable, i.e., ℙ(Z = +1) = ℙ(Z = −1) = 1/2, and V ~ N(0, σ2 ·Id) is independent of Z, where σ is the standard deviation. We suppose σ is known. In high dimensional settings, we assume that β* ∈ ℝd is sparse. To avoid the degenerate case in which the two Gaussians in the mixture are identical, here we suppose that β* ≠ 0. See §A.1 for the detailed forms of Qn(·; ·), Mn(·) and Tn(·).
Mixture of Regression Model
We assume that Y ∈ ℝ and X ∈ ℝd satisfy
| (2.25) |
where X ~ N(0, Id), V ~ N(0, σ2) and Z is a Rademacher random variable. Here X, V and Z are independent. In the high dimensional regime, we assume β* ∈ ℝd is sparse. To avoid the degenerate case, we suppose β* ≠ 0. In addition, we assume that σ is known. See §A.2 for the detailed forms of Qn(·; ·), Mn(·) and Tn(·).
Regression with Missing Covariates
We assume that Y ∈ ℝ and X ∈ ℝd satisfy
| (2.26) |
where X ~ N(0, Id) and V ~ N(0, σ2) are independent. We assume β* is sparse. Let x1, …, xn be the n realizations of X. We assume that each coordinate of xi is missing (unobserved) independently with probability pm ∈ [0, 1). We treat the missing covariates as the latent variable and suppose that σ is known. See §A.3 for the detailed forms of Qn(·; ·), Mn(·) and Tn(·).
3 Theory of Computation and Estimation
Before we lay out the main results, we introduce three technical conditions, which will significantly simplify our presentation. These conditions will be verified for the two implementations of the high dimensional EM algorithm and three latent variable models.
The first two conditions, proposed by Balakrishnan et al. (2014), characterize the properties of the population version lower bound function Q(·; ·), i.e., the expectation of Qn(·; ·) defined in (2.5). We define the respective population version M-step as follows. For the maximization implementation of the M-step (Algorithm 2), we define
| (3.1) |
For the gradient ascent implementation of the M-step (Algorithm 3), we define
| (3.2) |
where η > 0 is the stepsize in Algorithm 3. Hereafter, we employ ℬ to denote the basin of attraction, i.e., the local region in which the proposed high dimensional EM algorithm enjoys desired statistical and computational guarantees.
Condition 3.1
We define two versions of this condition.
- Lipschitz-Gradient-1(γ1, ℬ). For the true parameter β* and any β ∈ ℬ, we have
where M(·) is the population version M-step (maximization implementation) defined in (3.1).(3.3) - Lipschitz-Gradient-2(γ2, ℬ). For the true parameter β* and any β ∈ ℬ, we have
(3.4)
Condition 3.1 defines a variant of Lipschitz continuity for ∇1Q(·; ·). Note that in (3.3) and (3.4), the gradient is taken with respect to the first variable of Q(·; ·). Meanwhile, the Lipschitz continuity is with respect to the second variable of Q(·; ·). Also, the Lipschitz property is defined only between the true parameter β* and an arbitrary β ∈ ℬ, rather than between two arbitrary β’s. In the sequel, we will use (3.3) and (3.4) in the analysis of the two implementations of the M-step respectively.
Condition 3.2
Concavity-Smoothness(μ, ν, ℬ). For any β1, β2 ∈ ℬ, Q(·; β*) is μ-smooth, i.e.,
| (3.5) |
and ν-strongly concave, i.e.,
| (3.6) |
This condition indicates that, when the second variable of Q(·; ·) is fixed to be β*, the function is ‘sandwiched’ between two quadratic functions. Conditions 3.1 and 3.2 are essential to establishing the desired optimization results. In detail, Condition 3.2 ensures that standard convex optimization results for strongly convex and smooth objective functions, e.g., in Nesterov (2004), can be applied to −Q(·; β*). Since our analysis will not only involve Q(·; β*), but also Q(·; β) for all β ∈ ℬ, we also need to quantify the difference between Q(·; β*) and Q(·; β) by Condition 3.1. It suggests that, this difference can be controlled in the sense that ∇1Q(·; β) is Lipschitz with respect to β. Consequently, for any β ∈ ℬ, the behavior of Q(·; β) mimics that of Q(·; β*). Hence, standard convex optimization results can be adapted to analyze −Q(·; β) for any β ∈ ℬ.
The third condition characterizes the statistical error between the sample version and population version M-steps, i.e., Mn(·) defined in Algorithms 2 and 3, and M(·) in (3.1) and (3.2). Recall that ‖ · ‖0 denotes the total number of nonzero entries in a vector.
Condition 3.3
Statistical-Error(ε, δ, s, n, ℬ). For any fixed β ∈ ℬ with ‖β‖0 ≤ s, we have that
| (3.7) |
holds with probability at least 1 − δ. Here ε > 0 possibly depends on δ, sparsity level s, sample size n, dimension d, as well as the basin of attraction ℬ.
In (3.7), the statistical error ε quantifies the ℓ∞-norm of the difference between the population version and sample version M-steps. Particularly, we constrain the input β of M(·) and Mn(·) to be s-sparse. Such a condition is different from the one used by Balakrishnan et al. (2014). In detail, they quantify the statistical error with the ℓ2-norm and don’t constrain the input of M(·) and Mn(·) to be sparse. Consequently, our subsequent statistical analysis differs from theirs. The reason we use the ℓ∞-norm is that, it characterizes the more refined entrywise statistical error, which converges at a fast rate of (possibly with extra factors depending on specific models). In comparison, the ℓ2-norm statistical error converges at a slow rate of , which doesn’t decrease to zero as n increases when d≫n. Moreover, the fine-grained entrywise statistical error is crucial to quantifying the effects of the truncation step (line 6 of Algorithm 1) on the iterative solution sequence.
3.1 Main Results
Equipped with Conditions 3.1–3.3, we now lay out the computational and statistical results for the high dimensional EM algorithm. To simplify the technical analysis of the algorithm, we focus on its resampling version, which is illustrated in Algorithm 4.
Algorithm 4.
High Dimensional EM Algorithm with Resampling.
| 1: | Parameter: Sparsity Parameter ŝ, Maximum Number of Iterations T |
| 2: | Initialization: 𝒮̂init ← supp(βinit, ŝ), β(0) ← trunc(βinit, 𝒮̂init), |
| {supp(·, ·) and trunc(·, ·) are defined in (2.6) and (2.7)} | |
| Split the Dataset into T Subsets of Size n/T | |
| {Without loss of generality, we assume n/T is an integer} | |
| 3: | For t = 0 to T − 1 |
| 4: | E-step: Evaluate Qn/T (β; β(t)) with the t-th Data Subset |
| 5: | M-step: β(t+0.5) ← Mn/T(β(t)) |
| {Mn/T(·) is implemented as in Algorithm 2 or 3 with Qn(·; ·) replaced by Qn/T(·; ·)} | |
| 6: | T-step: 𝒮̂(t+0.5) ← supp(β(t+0.5), ŝ), β(t+1) ← trunc(β(t+0.5), 𝒮̂(t+0.5)) |
| 7: | End For |
| 8: | Output: β̂ ← β(T) |
Theorem 3.4
For Algorithm 4, we define ℬ = {β : ‖β − β*‖2 ≤ R}, where R = κ·‖β*‖2 for some κ∈(0, 1). We assume that Condition Concavity-Smoothness(μ, ν, ℬ) holds and ‖βinit − β*‖2 ≤ R/2.
- For the maximization implementation of the M-step (Algorithm 2), we suppose that Condition Lipschitz-Gradient-1(γ1, ℬ) holds with γ1/ν ∈ (0, 1) and
(3.8)
where C ≥ 1 and C′ > 0 are absolute constants. Under Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) we have that, for t = 1, …, T,(3.9)
holds with probability at least 1 − δ, where C′ is the same constant as in (3.9).(3.10) - For the gradient ascent implementation of the M-step (Algorithm 3), we suppose that Condition Lipschitz-Gradient-2(γ2, ℬ) holds with 1 − 2 · (ν − γ2)/(ν +μ) ∈ (0, 1) and the stepsize in Algorithm 3 is set to η = 2/(ν + μ). Meanwhile, we assume that
(3.11)
where C ≥ 1 and C′ > 0 are absolute constants. Under Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) we have that, for t = 1, …, T,(3.12)
holds with probability at least 1 − δ, where C′ is the same constant as in (3.12).(3.13)
Proof
See §5.1 for a detailed proof.
The assumptions in (3.8) and (3.11) state that the sparsity parameter ŝ in Algorithm 4 is chosen to be sufficiently large and also of the same order as the true sparsity level s*. These assumptions, which will be used by Lemma 5.1 in the proof of Theorem 3.4, ensure that the error incurred by the truncation step can be upper bounded. In addition, as will be shown for specific models in §3.2, the error term ε in Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) decreases as sample size n increases. By the assumptions in (3.8) and (3.11), is of the same order as . Therefore, the assumptions in (3.9) and (3.12) suggest the sample size n is sufficiently large such that is sufficiently small. These assumptions guarantee that the entire iterative solution sequence remains within the basin of attraction ℬ in the presence of statistical error. The assumptions in (3.8), (3.9), (3.11), (3.12) will be more explicit as we specify the values of γ1, γ2, ν, μ and κ for specific models.
Theorem 3.4 illustrates that, the upper bound of the overall estimation error can be decomposed into two terms. The first term is the upper bound of optimization error, which decreases to zero at a geometric rate of convergence, because we have γ1/ν < 1 in (3.10) and 1 − 2 · (ν − γ2)/(ν + μ) < 1 in (3.13). Meanwhile, the second term is the upper bound of statistical error, which doesn’t depend on t. Since is of the same order as , this term is proportional to ·ε, where ε is the entrywise statistical error between M(·) and Mn(·). We will prove that, for a variety of models and the two implementations of the M-step, ε is roughly of the order . (There may be extra factors attached to ε depending on each specific model.) Hence, the statistical error term is roughly of the order . Consequently, for a sufficiently large t = T such that the optimization and statistical error terms in (3.10) or (3.13) are of the same order, the final estimator β̂ = β(T) attains a (near-)optimal (possibly with extra factors) rate of convergence for a broad variety of high dimensional latent variable models.
3.2 Implications for Specific Models
To establish the corresponding results for specific models under the unified framework, we only need to establish Conditions 3.1–3.3 and determine the key quantities R, γ1, γ2, ν, μ, κ and ε. Recall that Conditions 3.1 and 3.2 and the models analyzed in our paper are identical to those in Balakrishnan et al. (2014). Meanwhile, note that Conditions 3.1 and 3.2 only involve the population version lower bound function Q(·; ·) and M-step M(·). Since Balakrishnan et al. (2014) prove that the quantities in Conditions 3.1 and 3.2 are independent of the dimension d and sample size n, their corresponding results can be directly adapted. To establish the final results, it still remains to verify Condition 3.3 for each high dimensional latent variable model.
Gaussian Mixture Model
The following lemma, which is proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for Gaussian mixture model. Recall that σ is the standard deviation of each individual Gaussian distribution within the mixture.
Lemma 3.5
Suppose we have ‖β*‖2/σ ≥ r, where r > 0 is a sufficiently large constant that denotes the minimum signal-to-noise ratio. There exists some absolute constant C > 0 such that Conditions Lipschitz-Gradient-1(γ1, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with
| (3.14) |
Proof
See the proof of Corollary 1 in Balakrishnan et al. (2014) for details.
Now we verify Condition 3.3 for the maximization implementation of the M-step (Algorithm 2).
Lemma 3.6
For the maximization implementation of the M-step (Algorithm 2), we have that for a sufficiently large n and ℬ specified in (3.14), Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
| (3.15) |
Proof
See §B.4 for a detailed proof.
The next theorem establishes the implication of Theorem 3.4 for Gaussian mixture model.
Theorem 3.7
We consider the maximization implementation of M-step (Algorithm 2). We assume ‖β*‖2/σ ≥ r holds with a sufficiently large r > 0, and ℬ and R are as defined in (3.14). We suppose the initialization βinit of Algorithm 4 satisfies ‖βinit − β*‖2 ≤ R/2. Let the sparsity parameter ŝ be
| (3.16) |
with C specified in (3.14) and C′ ≥ 1. Let the total number of iterations T in Algorithm 4 be
| (3.17) |
Meanwhile, we suppose the dimension d is sufficiently large such that T in (3.17) is upper bounded by , and the sample size n is sufficiently large such that
| (3.18) |
We have that, with probability at least 1 − 2 · d−1/2, the final estimator β̂ = β(T) satisfies
| (3.19) |
Proof
First we plug the quantities in (3.14) and (3.15) into Theorem 3.4. Recall that Theorem 3.4 requires Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ). Thus we need to replace δ and n with δ/T and n/T in the definition of ε in (3.15). Then we set δ = 2 · d−1/2. Since T is specified in (3.17) and the dimension d is sufficiently large such that , we have log [2/(δ/T)] ≤ log d in the definition of ε. By (3.16) and (3.18), we can then verify the assumptions in (3.8) and (3.9). Finally, by plugging in T in (3.17) into (3.10) and taking t = T, we can verify that in (3.9) the optimization error term is smaller than the statistical error term up to a constant factor. Therefore, we obtain (3.19).
To see the statistical rate of convergence with respect to s*, d and n, for the moment we assume that R, r, ‖β*‖∞, ‖β*‖2 and σ are absolute constants. From (3.16) and (3.17), we obtain ŝ = C·s* and therefore , which implies . Hence, according to (3.19) we have that, with high probaibility,
Because the minimax lower bound for estimating an s*-sparse d-dimensional vector is , the rate of convergence in (3.19) is optimal up to a factor of log n. Such a logarithmic factor results from the resampling scheme in Algorithm 4, since we only utilize n/T samples within each iteration. We expect that by directly analyzing Algorithm 1 we can eliminate such a logarithmic factor, which however incurs extra technical complexity for the analysis.
Mixture of Regression Model
The next lemma, proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for mixture of regression model. Recall that β* is the regression coefficient and σ is the standard deviation of the Gaussian noise.
Lemma 3.8
Suppose ‖β*‖2/σ ≥ r, where r > 0 is a sufficiently large constant that denotes the required minimum signal-to-noise ratio. Conditions Lipschitz-Gradient-1(γ1, ℬ), Lipschitz-Gradient-2(γ2, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with
| (3.20) |
Proof
See the proof of Corollary 3 in Balakrishnan et al. (2014) for details.
The following lemma establishes Condition 3.3 for the two implementations of the M-step.
Lemma 3.9
For ℬ specified in (3.20), we have the following results.
- For the maximization implementation of the M-step (line 5 of Algorithm 4), we have that Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
for sufficiently large sample size n and absolute constant C > 0.(3.21) - For the gradient ascent implementation, Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
for sufficiently large sample size n and C > 0, where η denotes the stepsize in Algorithm 3.(3.22)
Proof
See §B.5 for a detailed proof.
The next theorem establishes the implication of Theorem 3.4 for mixture of regression model.
Theorem 3.10
Let ‖β*‖2/σ ≥ r with r > 0 sufficiently large. Assuming that ℬ and R are specified in (3.20) and the initialization βinit satisfies ‖βinit − β*‖2 ≤ R/2, we have the following results.
- For the maximization implementation of the M-step (Algorithm 2), let ŝ and T be
We suppose d and n are sufficiently large such that and
Then with probability at least 1 − 4 · d−1/2, the final estimator β̂ = β(T) satisfies(3.23) - For the gradient ascent implementation of the M-step (Algorithm 3) with stepsize set to η = 1, let ŝ and T be
We suppose d and n are sufficiently large such that and
Then with probability at least 1 − 4 · d−1/2, the final estimator β̂ = β(T) satisfies(3.24)
Proof
The proof is similar to Theorem 3.7.
To understand the intuition of Theorem 3.10, we suppose that ‖β*‖2, σ, R and r are absolute constants, which implies ŝ = C·s* and . Therefore, for the maximization and gradient ascent implementations of the M-step, we have T = C′·log [n/(s*·log d)] and T = C″·log {n/[(s*)2·log d]} correspondingly. Hence, by (3.23) in Theorem 3.10 we have that, for the maximization implementation of the M-step,
| (3.25) |
holds with high probability. Meanwhile, from (3.24) in Theorem 3.10 we have that, for the gradient ascent implementation of the M-step,
| (3.26) |
holds with high probability. The statistical rates in (3.25) and (3.26) attain the minimax lower bound up to factors of and respectively and are therefore near-optimal. Note that the statistical rate of convergence attained by the gradient ascent implementation of the M-step is slower by a factor than the rate of the maximization implementation. However, our discussion in §A.2 illustrates that, for mixture of regression model, the gradient ascent implementation doesn’t involve estimating the inverse covariance of X in (2.25). Hence, the gradient ascent implementation is more computationally efficient, and is also applicable to the settings in which X has more general covariance structures.
Regression with Missing Covariates
Recall that we consider the linear regression model with covariates missing completely at random, which is defined in §2.3. The next lemma, which is proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for this model. Recall that pm denotes the probability that each covariate is missing and σ is the standard deviation of the Gaussian noise.
Lemma 3.11
Suppose ‖β*‖2/σ ≤ r, where r > 0 is a constant that denotes the required maximum signal-to-noise ratio. Assuming that we have
for some constant κ ∈ (0, 1), then we have that Conditions Lipschitz-Gradient-2(γ2, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with
| (3.27) |
Proof
See the proof of Corollary 6 in Balakrishnan et al. (2014) for details.
The next lemma proves Condition 3.3 for the gradient ascent implementation of the M-step.
Lemma 3.12
Suppose ℬ is defined in (3.27) and ‖β*‖2/σ ≤ r. For the gradient ascent implementation of the M-step (Algorithm 3), Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
| (3.28) |
for sufficiently large sample size n and C > 0. Here η denotes the stepsize in Algorithm 3.
Proof
See §B.6 for a detailed proof.
Next we establish the implication of Theorem 3.4 for regression with missing covariates.
Theorem 3.13
We consider the gradient ascent implementation of M-step (Algorithm 3) in which η = 1. Under the assumptions of Lemma 3.11, we suppose the sparsity parameter ŝ takes the form of (3.11) and the initialization βinit satisfies ‖βinit − β*‖2 ≤ R/2 with ν, μ, γ2, κ and R specified in (3.27). For r > 0 specified in Lemma 3.11, let the total number of iterations T in Algorithm 4 be
| (3.29) |
We assume d and n are sufficiently large such that and
We have that, with probability at least 1 − 12 · d−1/2, the final estimator β̂ = β(T) satisfies
| (3.30) |
Proof
The proof is similar to Theorem 3.7.
Assuming r, pm and κ are absolute constants, we have ΔRMC(s*) = C·s* in (3.29) since ŝ = C′·s*. Hence, we obtain T = C″·log{n/[(s*)2·log d]}. By (3.30) we have that, with high probability,
which is near-optimal with respect to the minimax lower bound. It is worth noting that the assumption ‖β*‖2/σ ≤ r in Lemma 3.12 requires the signal-to-noise ratio to be upper bounded rather than lower bounded, which differs from the assumptions for the previous two models. Such a counter-intuitive phenomenon is explained by Loh and Wainwright (2012).
4 Theory of Inference
To simplify the presentation of the unified framework, we first lay out several technical conditions, which will later be verified for each model. Some of the notations used in this section are introduced in §2.2. The first condition characterizes the statistical rate of convergence of the estimator attained by the high dimensional EM algorithm (Algorithm 4).
Condition 4.1 Parameter-Estimation (ζEM)
It holds that
where ζEM scales with s*, d and n.
Since both β̂ and β* are sparse, we can verify this condition for each model based on the ℓ2-norm recovery results in Theorems 3.7, 3.10 and 3.13. The second condition quantifies the statistical error between the gradients of Qn(β*; β*) and Q(β*; β*).
Condition 4.2 Gradient-Statistical-Error (ζG)
For the true parameter β*, it holds that
where ζG scales with s*, d and n.
Note that for the gradient ascent implementation of the M-step (Algorithm 3) and its population version defined in (3.2), we have
Thus, we can verify Condition 4.2 for each model following the proof of Lemmas 3.6, 3.9 and 3.12. Recall that Tn(·) is defined in (2.8). The following condition quantifies the difference between Tn(β*) and its expectation. Recall that ‖·‖∞,∞ denotes the maximum magnitude of all entries in a matrix.
Condition 4.3 Tn(·)-Concentration (ζT)
For the true parameter β*, it holds that
where ζT scales with d and n.
By Lemma 2.1 we have 𝔼β* [Tn(β*)] = −I(β*). Hence, Condition 4.3 characterizes the statistical rate of convergence of Tn(β*) for estimating the Fisher information. Since β* is unknown in practice, we use Tn(β̂) or Tn(β̂0) to approximate Tn(β*), where β̂0 is defined in (2.11). The next condition quantifies the accuracy of this approximation.
Condition 4.4 Tn(·)-Lipschitz(ζL)
For the true parameter β* and any β, we have
where ζL scales with d and n.
Condition 4.4 characterizes the Lipschitz continuity of Tn(·). We consider β = β̂. Since Condition 4.1 ensures ‖β̂−β*‖1 converges to zero at the rate of ζEM, Condition 4.4 implies Tn(β̂) converges to Tn(β*) entrywise at the rate of ζEM·ζL. In other words, Tn(β̂) is a good approximation of Tn(β*).
In the following, we lay out an assumption on several population quantities and the sample size n. Recall that β* = [α*, (γ*)⊤]⊤, where α* ∈ ℝ is the entry of interest, while γ* ∈ ℝd−1 is the nuisance parameter. By the notations introduced in §2.2, [I(β*)]γ,γ ∈ ℝ(d−1)×(d−1) and [I(β*)]γ,α ∈ ℝ(d−1)×1 denote the submatrices of the Fisher information matrix I(β*) ∈ ℝd×d. We define w*, and as
| (4.1) |
We define λ1 [I(β*)] and λd [I(β*)] as the largest and smallest eigenvalues of I(β*), and
| (4.2) |
According to (4.1) and (4.2), we can easily verify that
| (4.3) |
The following assumption ensures that λd [I(β*)] >0. Hence, [I(β*)]γ,γ in (4.1) is invertible. Also, according to (4.3) and the fact that λd[I(β*)] >0, we have [I(β*)]α|γ >0.
Assumption 4.5
We impose the following assumptions.
- For positive absolute constants ρmax and ρmin, we assume
(4.4) - The tuning parameter λ of the Dantzig selector in (2.10) is set to
where C ≥ 1 is a sufficiently large absolute constant. We suppose the sample size n is sufficiently large such that(4.5) (4.6)
The assumption on λd[I(β*)] guarantees that the Fisher information matrix is positive definite. The other assumptions in (4.4) guarantee the existence of the asymptotic variance of in the score statistic defined in (2.11) and that of in the Wald statistic defined in (2.19). Similar assumptions are standard in existing asymptotic inference results. For example, for mixture of regression model, Khalili and Chen (2007) impose variants of these assumptions.
For specific models, we will show that ζEM, ζG, ζT and λ all decrease with n, while ζL increases with n at a slow rate. Therefore, the assumptions in (4.6) ensure that the sample size n is sufficiently large. We will make these assumptions more explicit after we specify ζEM, ζG, ζT and ζL for each model. Note the assumptions in (4.6) imply that needs to be small. For instance, for λ specified in (4.5), in (4.6) implies . In the following, we will prove that ζT is of the order . Hence, we require that , i.e., w* ∈ ℝd−1 is sparse. We can understand this sparsity assumption as follows.
According to the definition of w* in (4.1), we have [I(β*)]γ,γ·w* = [I(β*)]γ,α. Therefore, such a sparsity assumption suggests that [I(β*)]γ,α lies within the span of a few columns of [I(β*)]γ,γ.
By block matrix inversion formula we have {[I(β*)]−1}γ,α = δ · w*, where δ ∈ ℝ. Hence, it can also be understood as a sparsity assumption on the (d − 1) × 1 submatrix of [I(β*)]−1.
Such a sparsity assumption on w* is necessary, because otherwise it is difficult to accurately estimate w* in high dimensional regimes. In the context of high dimensional generalized linear models, Zhang and Zhang (2014); van de Geer et al. (2014) impose similar sparsity assumptions.
4.1 Main Results
Equipped with Conditions 4.1–4.4 and Assumption 4.5, we are now ready to establish our theoretical results to justify the inferential methods proposed in §2.2. We first cover the decorrelated score test and then the decorrelated Wald test. Finally, we establish the optimality of our proposed methods.
Decorrelated Score Test
The next theorem establishes the asymptotic normality of the decorrelated score statistic defined in (2.11).
Theorem 4.6
We consider β* = [α*, (γ*)⊤]⊤ with α* = 0. Under Assumption 4.5 and Conditions 4.1–4.4, we have that for n → ∞,
| (4.7) |
where β̂0 and [Tn(β̂0)]α|γ ∈ ℝ are defined in (2.11).
Proof
See §5.2 for a detailed proof.
Decorrelated Wald Test
The next theorem provides the asymptotic normality of , which implies the confidence interval for α* in (2.20) is valid. In particular, for testing the null hypothesis H0: α* = 0, the next theorem implies the asymptotic normality of the decorrelated Wald statistic defined in (2.19).
Theorem 4.7
Under Assumption 4.5 and Conditions 4.1–4.4, we have that for n → ∞,
| (4.8) |
where [Tn(β̂)]α|γ ∈ ℝ is defined by replacing β̂0 with β̂ in (2.11).
Proof
See §5.2 for a detailed proof.
Optimality
In Lemma 5.3 in the proof of Theorem 4.6 we show that, the limiting variance of the decorrelated score function is [I(β*)]α|γ, which is defined in (4.2). In fact, van der Vaart (2000) proves that for inferring α* in the presence of the nuisance parameter γ*, [I(β*)]α|γ is named the semiparametric efficient information, i.e., the minimum limiting variance of the (rescaled) score function. Our proposed decorrelated score function achieves such an information lower bound and is hence optimal. Because the decorrelated Wald test and the respective confidence interval are built on the decorrelated score function, the limiting variance of and the size of the confidence interval are also optimal in terms of the semiparametric information lower bound.
4.2 Implications for Specific Models
To establish the high dimensional inference results for each model, we only need to verify Conditions 4.1–4.4 and determine the key quantities ζEM, ζG, ζT and ζL. In the following, we focus on Gaussian mixture and mixture of regression models.
Gaussian Mixture Model
The following lemma verifies Conditions 4.1 and 4.2.
Lemma 4.8
We have that Conditions 4.1 and 4.2 hold with
where ŝ, ΔGMM(s*), r and T are as defined in Theorem 3.7.
Proof
See §C.6 for a detailed proof.
By our discussion that follows Theorem 3.7, we have that ŝ and ΔGMM(s*) are of the same order as s* and respectively, and T is roughly of the order . Therefore, ζEM is roughly of the order . The following lemma verifies Condition 4.3 for Gaussian mixture model.
Lemma 4.9
We have that Condition 4.3 holds with
Proof
See §C.7 for a detailed proof.
The following lemma establishes Condition 4.4 for Gaussian mixture model.
Lemma 4.10
We have that Condition 4.4 holds with
Proof
See §C.8 for a detailed proof.
Equipped with Lemmas 4.8–4.10, we establish the inference results for Gaussian mixture model.
Theorem 4.11
Under Assumption 4.5, we have that for n→∞, (4.7) and (4.8) hold for Gaussian mixture model. Also, the proposed confidence interval for α* in (2.20) is valid.
In fact, for Gaussian mixture model we can make (4.6) in Assumption 4.5 more transparent by plugging in ζEM, ζG, ζT and ζL specified above. Particularly, for simplicity we assume all quantities except , s*, d and n are absolute constants. Then we can verify that (4.6) holds if
| (4.9) |
According to the discussion following Theorem 3.7, we require s*·log d = o(n/log n) for the estimator β̂ to be consistent. In comparison, (4.9) illustrates that high dimensional inference requires a higher sample complexity than parameter estimation. In the context of high dimensional generalized linear models, Zhang and Zhang (2014); van de Geer et al. (2014) also observe the same phenomenon.
Mixture of Regression Model
The following lemma verifies Conditions 4.1 and 4.2. Recall that ŝ, T, and are defined in Theorem 3.10, while σ denotes the standard deviation of the Gaussian noise in mixture of regression model.
Lemma 4.12
We have that Conditions 4.1 and 4.2 hold with
where we have for the maximization implementation of the M-step (Algorithm 2), and for the gradient ascent implementation of the M-step (Algorithm 3).
Proof
See §C.9 for a detailed proof.
By our discussion that follows Theorem 3.10, we have that ŝ is of the same order as s*. For the maximization implementation of the M-step (Algorithm 2), we have that is of the same order as . Meanwhile, for the gradient ascent implementation in Algorithm 3, we have that is of the same order as s*. Hence, ζEM is of the order or correspondingly, since T is roughly of the order . The next lemma establishes Condition 4.3 for mixture of regression model.
Lemma 4.13
We have that Condition 4.3 holds with
Proof
See §C.10 for a detailed proof.
The following lemma establishes Condition 4.4 for mixture of regression model.
Lemma 4.14
We have that Condition 4.4 holds with
Proof
See §C.11 for a detailed proof.
Equipped with Lemmas 4.12–4.14, we are now ready to establish the high dimensional inference results for mixture of regression model.
Theorem 4.15
For mixture of regression model, under Assumption 4.5, both (4.7) and (4.8) hold as n → ∞. Also, the proposed confidence interval for α* in (2.20) is valid.
Similar to the discussion that follows Theorem 4.11, we can make (4.6) in Assumption 4.5 more explicit by plugging in ζEM, ζG, ζT and ζL specified in Lemmas 4.12–4.14. Assuming all quantities except , s*, d and n are absolute constants, we have that (4.6) holds if
In contrast, for high dimensional estimation, we only require (s*)2 · log d = o(n/log n) to ensure the consistency of β̂ by our discussion following Theorem 3.10.
5 Proof of Main Results
We lay out a proof sketch of the main theory. First we prove the results in Theorem 3.4 for parameter estimation and computation. Then we establish the results in Theorems 4.6 and 4.7 for inference.
5.1 Proof of Results for Computation and Estimation
Proof of Theorem 3.4
First we introduce some notations. Recall that the trunc(·, ·) function is defined in (2.7). We define β̅(t+0.5), β̅(t+1) ∈ ℝd as
| (5.1) |
As defined in (3.1) or (3.2), M(·) is the population version M-step with the maximization or gradient ascent implementation. Here 𝒮̂(t+0.5) denotes the set of index j’s with the top ŝ largest . It is worth noting 𝒮̂(t+0.5) is calculated based on β(t+0.5) in the truncation step (line 6 of Algorithm 4), rather than based on β̅(t+0.5) defined in (5.1).
Our goal is to characterize the relationship between ‖β(t+1) − β*‖2 and ‖β(t) − β*‖2. According to the definition of the truncation step (line 6 of Algorithm 4) and triangle inequality, we have
| (5.2) |
where the last equality is obtained from (5.1). According to the definition of the trunc(·, ·) function in (2.7), the two terms within term (i) both have support 𝒮̂(t+0.5) with cardinality ŝ. Thus, we have
| (5.3) |
Since we have β(t+0.5) = Mn(β(t)) and β̅(t+0.5) = M(β(t)), we can further establish an upper bound for the right-hand side by invoking Condition 3.3.
Our subsequent proof will establish an upper bound for term (ii) in (5.2) in two steps. We first characterize the relationship between ‖β̅(t+1) − β*‖2 and ‖β̅(t+0.5) − β*‖2 and then the relationship between ‖β̅(t+0.5) − β*‖2 and ‖β(t) − β*‖2. The next lemma accomplishes the first step. Recall that ŝ is the sparsity parameter in Algorithm 4, while s* is the sparsity level of the true parameter β*.
Lemma 5.1
Suppose that we have
| (5.4) |
for some κ ∈ (0, 1). Assuming that we have
| (5.5) |
then it holds that
| (5.6) |
Proof
The proof is based on fine-grained analysis of the relationship between 𝒮̂(t+0.5) and the true support 𝒮*. In particular, we focus on three index sets, namely, ℐ1 = 𝒮*\𝒮̂(t+0.5), ℐ2 = 𝒮* ∩ 𝒮̂(t+0.5) and ℐ3= 𝒮̂(t+0.5)\𝒮*. Among them, ℐ2 characterizes the similarity between 𝒮̂(t+0.5) and 𝒮*, while ℐ1 and ℐ3 characterize their difference. The key proof strategy is to represent the three distances in (5.6) with the ℓ2-norms of the restrictions of β̅(t+0.5) and β* on the three index sets. In particular, we focus and . In order to quantify these ℓ2-norms, we establish a fine-grained characterization for the absolute values of β̅(t+0.5)’s entries that are selected and eliminated within the truncation operation β̅(t+1) ← trunc(β̅(t+0.5), 𝒮̂(t+0.5)). See §B.1 for a detailed proof.
Lemma 5.1 is central to the proof of Theorem 3.4. In detail, the assumption in (5.4) guarantees β̅(t+0.5) is within the basin of attraction. In (5.5), the first assumption ensures the sparsity parameter ŝ in Algorithm 4 is set to be sufficiently large, while second ensures the statistical error is sufficiently small. These assumptions will be verified in the proof of Theorem 3.4. The intuition behind (5.6) is as follows. Let 𝒮̅(t+0.5) = supp(β̅(t+0.5), ŝ), where supp(·, ·) is defined in (2.6). By triangle inequality, the left-hand side of (5.6) satisfies
| (5.7) |
Intuitively, the two terms on right-hand side of (5.6) reflect terms (i) and (ii) in (5.7) correspondingly. In detail, for term (i) in (5.7), recall that according to (5.1) and line 6 of Algorithm 4 we have
As the sample size n is sufficiently large, β̅(t+0.5) and β(t+0.5) are close, since they are attained by the population version and sample version M-steps correspondingly. Hence, 𝒮̅(t+0.5) = supp(β̅(t+0.5), ŝ) and 𝒮̂(t+0.5) = supp(β(t+0.5), ŝ) should be similar. Thus, in term (i), β̅(t+1) = trunc(β̅(t+0.5), 𝒮̂(t+0.5)) should be close to trunc(β̅(t+0.5), 𝒮̅(t+0.5)) up to some statistical error, which is reflected by the first term on the right-hand side of (5.6).
Also, we turn to quantify the relationship between ‖β̅(t+0.5) − β*‖2 in (5.6) and term (ii) in (5.7). The truncation in term (ii) preserves the top ŝ coordinates of β̅(t+0.5) with the largest magnitudes while setting others to zero. Intuitively speaking, the truncation incurs additional error to β̅(t+0.5)’s distance to β*. Meanwhile, note that when β̅(t+0.5) is close to β*, 𝒮̅(t+0.5) is similar to 𝒮*. Therefore, the incurred error can be controlled, because the truncation doesn’t eliminate many relevant entries. In particular, as shown in the second term on the right-hand side of (5.6), such incurred error decays as ŝ increases, since in this case 𝒮̂(t+0.5) includes more entries. According to the discussion for term (i) in (5.7), 𝒮̅(t+0.5) is similar to 𝒮̂(t+0.5), which implies that 𝒮̅(t+0.5) should also cover more entries. Thus, fewer relevant entries are wrongly eliminated by the truncation and hence the incurred error is smaller. The extreme case is that, when ŝ → ∞, term (ii) in (5.7) becomes ‖β̅(t+0.5) − β*‖2, which indicates that no additional error is incurred by the truncation. Correspondingly, the second term on the right-hand side of (5.6) also becomes ‖β̅(t+0.5) − β*‖2.
Next, we turn to characterize the relationship between ‖β̅(t+0.5) − β*‖2 and ‖β(t) − β*‖2. Recall β̅(t+0.5) = M(β(t)) is defined in (5.1). The next lemma, which is adapted from Theorems 1 and 3 in Balakrishnan et al. (2014), characterizes the contraction property of the population version M-step defined in (3.1) or (3.2).
Lemma 5.2
Under the assumptions of Theorem 3.4, the following results hold for β(t) ∈ ℬ.
- For the maximization implementation of the M-step (Algorithm 2), we have
(5.8) - For the gradient ascent implementation of the M-step (Algorithm 3), we have
(5.9)
Here γ1, γ2, μ and ν are defined in Conditions 3.1 and 3.2.
Proof
The proof strategy is to first characterize the M-step using Q(·; β*). According to Condition Concavity-Smoothness(μ, ν, ℬ), −Q(·; β*) is ν-strongly convex and μ-smooth, and thus enjoys desired optimization guarantees. Then Condition Lipschitz-Gradient-1(γ1, ℬ) or Lipschitz-Gradient-2(γ2, ℬ) is invoked to characterize the difference between Q(·; β*) and Q(·; β(t)). We provide the proof in §B.2 for the sake of completeness.
Equipped with Lemmas 5.1 and 5.2, we are now ready to prove Theorem 3.4.
Proof
To unify the subsequent proof for the maximization and gradient implementations of the M-step, we employ ρ ∈ (0, 1) to denote γ1/ν in (5.8) or 1 – 2·(ν − γ2)/(ν + μ) in (5.9). By the definitions of β̅(t+0.5) and β(t+0.5) in (5.1) and Algorithm 4, Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) implies
holds with probability at least 1 − δ/T. Then by taking union bound we have that, the event
| (5.10) |
occurs with probability at least 1 − δ. Conditioning on ℰ, in the following we prove that
| (5.11) |
by mathematical induction.
Before we lay out the proof, we first prove β(0) ∈ ℬ. Recall βinit is the initialization of Algorithm 4 and R is the radius of the basin of attraction ℬ. By the assumption of Theorem 3.4, we have
| (5.12) |
Therefore, (5.12) implies ‖βinit − β*‖2 < κ·‖β*‖2 since R = κ·‖β*‖2. Invoking the auxiliary result in Lemma B.1, we obtain
| (5.13) |
Here the second inequality is from (5.12) as well as the assumption in (3.8) or (3.11), which implies s*/ŝ ≤ (1 − κ)2/[4·(1+κ)2] ≤ 1/4. Thus, (5.13) implies β(0) ∈ ℬ. In the sequel, we prove that (5.11) holds for t = 1. By invoking Lemma 5.2 and setting t = 0 in (5.8) or (5.9), we obtain
where the second inequality is from (5.13). Hence, the assumption in (5.4) of Lemma 5.1 holds for β̅(0.5). Furthermore, by the assumptions in (3.8), (3.9), (3.11) and (3.12) of Theorem 3.4, we can also verify that the assumptions in (5.5) of Lemma 5.1 hold conditioning on the event ℰ defined in (5.10). By invoking Lemma 5.1 we have that (5.6) holds for t = 0. Further plugging ‖β(t+0.5) − β̅(t+0.5)‖∞ ≤ ε in (5.10) into (5.6) with t = 0, we obtain
| (5.14) |
Setting t = 0 in (5.8) or (5.9) of Lemma 5.2 and then plugging (5.8) or (5.9) into (5.14), we obtain
| (5.15) |
For t = 0, plugging (5.3) into term (i) in (5.2), and (5.15) into term (ii) in (5.2), and then applying ‖β(t+0.5) − β̅(t+0.5)‖∞ ≤ ε with t = 0 in (5.10), we obtain
| (5.16) |
By our assumption that ŝ ≥ 16 ·(1/ρ − 1)−2 ·s* in (3.8) or (3.11), we have in (5.16). Hence, from (5.16) we obtain
| (5.17) |
which implies that (5.11) holds for t = 1, since we have in (5.11).
Suppose we have that (5.11) holds for some t ≥ 1. By (5.11) we have
| (5.18) |
where the second inequality is from (5.13) and our assumption in (3.9) or (3.12). Therefore, by (5.18) we have β(t) ∈ ℬ. Then (5.8) or (5.9) in Lemma 5.2 implies
where the third inequality is from ρ ∈ (0, 1). Following the same proof for (5.17), we obtain
Here the second inequality is obtained by plugging in (5.11) for t. Hence we have that (5.11) holds for t + 1. By induction, we conclude that (5.11) holds conditioning on the event ℰ defined in (5.10), which occurs with probability at least 1 − δ. By plugging the specific definitions of ρ into (5.11), and applying ‖β(0) − β*‖2 ≤ R in (5.13) to the right-hand side of (5.11), we obtain (3.10) and (3.13).
5.2 Proof of Results for Inference
Proof of Theorem 4.6
We establish the asymptotic normality of the decorrelated score statistic defined in (2.11) in two steps. We first prove the asymptotic normality of , where β̂0 is defined in (2.11) and Sn(·, ·) is defined in (2.9). Then we prove that − [Tn(β̂0)]α|γ defined in (2.11) is a consistent estimator of asymptotic variance. The next lemma accomplishes the first step. Recall I(β*) = −𝔼β*[∇2ℓn(β*)]/n is the Fisher information for ℓn(β*) defined in (2.2).
Lemma 5.3
Under the assumptions of Theorem 4.6, we have that for n → ∞,
where [I(β*)]α|γ is defined in (4.2).
Proof
Our proof consists of two steps. Note that by the definition in (2.9) we have
| (5.19) |
Recall that is defined in (4.1). At the first step, we prove
| (5.20) |
In other words, replacing β̂0 and w(β̂0, λ) in (5.19) with the corresponding population quantities β* and w* only introduces an oℙ(1) error term. Meanwhile, by Lemma 2.1 we have ∇1Qn(β*; β*) = ∇ℓn(β*)/n. Recall that ℓn(·) is the log-likelihood defined in (2.2), which implies that in (5.20)
is a (rescaled) average of n i.i.d. random variables. At the second step, we calculate the mean and variance of each term within this average and invoke the central limit theorem. Finally we combine these two steps by invoking Slutsky’s theorem. See §C.3 for a detailed proof.
The next lemma establishes the consistency of −[Tn(β̂0)]α|γ for estimating [I(β*)]α|γ. Recall that [Tn(β̂0)]α|γ ∈ ℝ and [I(β*)]α|γ ∈ ℝ are defined in (2.11) and (4.2) respectively.
Lemma 5.4
Under the assumptions of Theorem 4.6, we have
| (5.21) |
Proof
For notational simplicity, we abbreviate w(β̂0, λ) in the definition of [Tn(β̂0)]α|γ as ŵ0. By (2.11) and (4.3), we have
First, we establish the relationship between ŵ0 and w* by analyzing the Dantzig selector in (2.10). Meanwhile, by Lemma 2.1 we have 𝔼β* [Tn(β*)] = −I(β*). Then by triangle inequality we have
We prove term (i) is oℙ(1) by quantifying the Lipschitz continuity of Tn(·) using Condition 4.4. We then prove term (ii) is oℙ(1) by concentration analysis. Together with the result on the relationship between ŵ0 and w* we establish (5.21). See §C.4 for a detailed proof.
Combining Lemmas 5.3 and 5.4 using Slutsky’s theorem, we obtain Theorem 4.6.
Proof of Theorem 4.7
Similar to the proof of Theorem 4.6, we first prove is asymptotically normal, where α̅(β̂, λ) is defined in (2.18), while β̂ is the estimator attained by the high dimensional EM algorithm (Algorithm 4). By Lemma 5.4, we then show that −{[Tn(β̂)]α|γ}−1 is a consistent estimator of asymptotic variance. Here recall that [Tn(β̂)]α|γ is defined by replacing β̂0 with β̂ in (2.11). The following lemma accomplishes the first step.
Lemma 5.5
Under the assumptions of Theorem 4.7, we have that for n → ∞,
| (5.22) |
where [I(β*)]α|γ > 0 is defined in (4.2).
Proof
Our proof consists of two steps. Similar to the proof of Lemma 5.3, we first prove
Then we prove that, as n → ∞,
Combining the two steps by Slutsky’s theorem, we obtain (5.22). See §C.5 for a detailed proof.
Now we accomplish the second step for proving Theorem 4.7. Note that by replacing β̂0 with β̂ in the proof of Lemma 5.4, we have
| (5.23) |
Meanwhile, by the definitions of w* and [I(β*)]α|γ in (4.1) and (4.2), we have
| (5.24) |
Note that (4.4) in Assumption 4.5 implies that I(β*) is positive definite, which yields [I(β*)]α|γ > 0 by (5.24). Also, (4.4) gives [I(β*)]α|γ = O(1) and . Hence, by (5.23) we have
| (5.25) |
Combining (5.22) in Lemma 5.5 and (5.25) by invoking Slutsky’s theorem, we obtain Theorem 4.7.
6 Numerical Results
In this section, we lay out numerical results illustrating the computational and statistical efficiency of the methods proposed in §2. In §6.1 and §6.2, we present the results for parameter estimation and asymptotic inference respectively. Throughout §6, we focus on the high dimensional EM algorithm without resampling, which is illustrated in Algorithm 1.
6.1 Computation and Estimation
We empirically study two latent variable models: Gaussian mixture model and mixture of regression model. The first synthetic dataset is generated according to the Gaussian mixture model defined in (2.24). The second one is generated according to the mixture of regression model defined in (2.25). We set d = 256 and s* = 5. In particular, the first 5 entries of β* are set to {4, 4, 4, 6, 6}, while the rest 251 entries are set to zero. For Gaussian mixture model, the standard deviation of V in (2.24) is set to σ = 1. For mixture of regression model, the standard deviation of the Gaussian noise V in (2.25) is σ = 0.1. For both datasets, we generate n = 100 data points.
We apply the proposed high dimensional EM algorithm to both datasets. In particular, we apply the maximization implementation of the M-step (Algorithm 2) to Gaussian mixture model, and the gradient ascent implementation of the M-step (Algorithm 3) to mixture of regression model. We set the stepsize in Algorithm 3 to be η = 1. In Figure 1, we illustrate the evolution of the optimization error ‖β(t) − β̂*‖2 and overall estimation error ‖β(t) − β*‖2 with respect to the iteration index t. Here β̂ is the final estimator and β* is the true parameter.
Figure 1.
Evolution of the optimization error ‖β(t) − β̂‖2 and overall estimation error ‖β(t) − β*‖2 (in logarithmic scale). (a) The high dimensional EM algorithm for Gaussian mixture model (with the maximization implementation of the M-step). (b) The high dimensional EM algorithm for mixture of regression model (with the gradient ascent implementation of the M-step). Here note that in (a) the optimization error for t = 5, …, 10 is truncated due to arithmetic underflow.
Figure 1 illustrates the geometric decay of the optimization error, as predicted by Theorem 3.4. As the optimization error decreases to zero, β(t) converges to the final estimator β̂ and the overall estimation error ‖β(t) − β*‖2 converges to the statistical error ‖β̂ − β*‖2. In Figure 2, we illustrate the statistical rate of convergence of ‖β̂ − β*‖2 In particular, we plot ‖β̂ − β*‖2 against with varying s* and n. Figure 2 illustrates that the statistical error exhibits a linear dependency on . Hence, the proposed high dimensional EM algorithm empirically achieves an estimator with the optimal statistical rate of convergence.
Figure 2.
Statistical error ‖β̂ − β*‖2 plotted against with fixed d=128 and varying s* and n. (a) The high dimensional EM algorithm for Gaussian mixture model (with the maximization implementation of the M-step). (b) The high dimensional EM algorithm for mixture of regression model (with the gradient ascent implementation of the M-step).
6.2 Asymptotic Inference
To examine the validity of the proposed hypothesis testing procedures, we consider the same setting as in §6.1. Recall that we have β* = [4, 4, 4, 6, 6, 0, …, 0]⊤ ∈ ℝ256. We consider the null hypothesis . We construct the decorrelated score and Wald statistics based on the estimator obtained in the previous section, and fix the significance level to δ = 0.05. We repeat the experiment for 500 times and calculate the rejection rate as the type I error. In Table 1, we report the type I errors of the decorrelated score and Wald tests. In detail, Table 1 illustrates that the type I errors achieved by the proposed hypothesis testing procedures are close to the significance level, which validates our proposed hypothesis testing procedures.
Table 1.
Type I errors of the decorrelated Score and Wald tests
| Gaussian Mixture Model | Mixture of Regression | |
|---|---|---|
| Decorrelated Score Test | 0.052 | 0.050 |
| Decorrelated Wald Test | 0.049 | 0.049 |
7 Conclusion
We propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure. Our theory shows that, with a suitable initialization, the proposed algorithm converges at a geometric rate and achieves an estimator with the (near-)optimal statistical rate of convergence. Beyond point estimation, we further propose the decorrelated score and Wald statistics for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. We apply the proposed algorithmic framework to a broad family of high dimensional latent variable models. For these models, our framework establishes the first computationally feasible approach for optimal parameter estimation and asymptotic inference under high dimensional settings. Thorough numerical simulations are provided to back up our theory.
Acknowledgments
The authors are grateful for the support of NSF CAREER Award DMS1454377, NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.
Appendix
A Applications to Latent Variable Models
We provide the specific forms of Qn(·; ·), Mn(·) and Tn(·) for the models defined in §2.3. Recall that Qn(·; ·) is defined in (2.5), Mn(·) is defined in Algorithms 2 and 3, and Tn(·) is defined in (2.8).
A.1 Gaussian Mixture Model
Let y1, …, yn be the n realizations of Y. For the E-step (line 4 of Algorithm 1), we have
| (A.1) |
The maximization implementation (Algorithm 2) of the M-step takes the form
| (A.2) |
Meanwhile, for the gradient ascent implementation (Algorithm 3) of the M-step, we have
Here η > 0 is the stepsize. For asymptotic inference, Tn(·) in (2.8) takes the form
A.2 Mixture of Regression Model
Let y1, …, yn and x1, …, xn be the n realizations of Y and X. For the E-step (line 4 of Algorithm 1), we have
| (A.3) |
For the maximization implementation (Algorithm 2) of the M-step (line 5 of Algorithm 1), we have that Mn(β) = argmaxβ′ Qn(β′; β) satisfies
| (A.4) |
However, in high dimensional regimes, the sample covariance matrix Σ̂ is not invertible. To estimate the inverse covariance matrix of X, we use the CLIME estimator proposed by Cai et al. (2011), i.e.,
| (A.5) |
where ‖·‖1,1 and ‖·‖∞,∞ are the sum and maximum of the absolute values of all entries respectively, and λCLIME > 0 is a tuning parameter. Based on (A.4), we modify the maximization implementation of the M-step to be
| (A.6) |
For the gradient ascent implementation (Algorithm 3) of the M-step, we have
| (A.7) |
Here η > 0 is the stepsize. For asymptotic inference, Tn(·) in (2.8) takes the form
It is worth noting that, for the maximization implementation of the M-step, the CLIME estimator in (A.5) requires that Σ−1 is sparse, where Σ is the population covariance of X. Since we assume X ~ N (0, Id), this requirement is satisfied. Nevertheless, for more general settings where Σ doesn’t possess such a structure, the gradient ascent implementation of the M-step is a better choice, since it doesn’t require inverse covariance estimation and is also more efficient in computation.
A.3 Regression with Missing Covariates
For notational simplicity, we define zi ∈ ℝd (i = 1, …, n) as the vector with zi,j = 1 if xi,j is observed and zi,j = 0 if xi,j is missing. Let be the subvector corresponding to the observed component of xi. For the E-step (line 4 of Algorithm 1), we have
| (A.8) |
Here mβ(·, ·) ∈ ℝd and Kβ(·, ·) ∈ ℝd×d are defined as
| (A.9) |
| (A.10) |
where ⊙ denotes the Hadamard product and diag(1−zi) is defined as the d×d diagonal matrix with the entries of 1−zi ∈ ℝd on its diagonal. Note that maximizing Qn(β′; β) over β′ requires inverting , which may be not invertible in high dimensions. Thus, we consider the gradient ascent implementation (Algorithm 3) of the M-step (line 5 of Algorithm 1), in which we have
| (A.11) |
Here η > 0 is the stepsize. For asymptotic inference, we can similarly calculate Tn(·) according to its definition in (2.8). However, here we omit the detailed form of Tn(·) since it is overly complicated.
B Proof of Results for Computation and Estimation
We provide the detailed proof of the main results in §3 for computation and parameter estimation. We first lay out the proof for the general framework, and then the proof for specific models.
B.1 Proof of Lemma 5.1
Proof
Recall β̅(t+0.5) and β̅(t+1) are defined in (5.1). Note that in (5.4) of Lemma 5.1 we assume
| (B.1) |
which implies
| (B.2) |
For notational simplicity, we define
| (B.3) |
Note that θ̅ and θ* are unit vectors, while θ is not, since it is obtained by normalizing β(t+0.5) with ‖β̅(t+0.5)‖2. Recall that the supp(·, ·) function is defined in (2.6). Hence we have
| (B.4) |
where the last equality follows from line 6 of Algorithm 4. To ease the notation, we define
| (B.5) |
Let s1 = |ℐ1|, s2 = |ℐ2| and s3 = |ℐ3| correspondingly. Also, we define Δ = 〈θ̅, θ*〉. Note that
| (B.6) |
Here the first equality is from supp(θ*) = 𝒮*, the second equality is from (B.5) and the last inequality is from Cauchy-Schwarz inequality. Furthermore, from (B.6) we have
| (B.7) |
To obtain the second inequality, we expand the square and apply 2ab ≤ a2 + b2. In the equality and the last inequality of (B.7), we use the fact that θ* and θ̅ are both unit vectors.
By (2.6) and (B.4), 𝒮̂(t+0.5) contains the index j’s with the top ŝ largest . Therefore, we have
| (B.8) |
because from (B.5) we have ℐ3 ⊆ 𝒮̂(t+0.5) and ℐ1 ∩ 𝒮̂(t+0.5) = ∅. Taking square roots of both sides of (B.8) and then dividing them by ‖β̅(t+0.5)‖2 (which is nonzero according to (B.2)), by the definition of θ in (B.3) we obtain
| (B.9) |
Equipped with (B.9), we now quantify the relationship between ‖θ̅ℐ3‖2 and ‖θ̅ℐ1‖2. For notational simplicity, let
| (B.10) |
Note that we have
which implies
| (B.11) |
where second inequality is obtained from (B.9), while the first and third are from triangle inequality. Plugging (B.11) into (B.7), we obtain
Since by definition we have Δ = 〈θ̅, θ*〉 ∈ [−1, 1], solving for ‖θ̅ℐ1‖2 in the above inequality yields
| (B.12) |
Here we employ the fact that s1 ≤ s* and s1/s3 ≤ (s1 + s2)/(s3 + s2) = s*/ŝ, which follows from (B.5) and our assumption in (5.5) that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2] < 1.
In the following, we prove that the right-hand side of (B.12) is upper bounded by Δ, i.e.,
| (B.13) |
We can verify that a sufficient condition for (B.13) to hold is that
| (B.14) |
which is obtained by solving for Δ in (B.13). When we are solving for Δ in (B.13), we use the fact that , which holds because
| (B.15) |
The first inequality is from our assumption in (5.5) that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2] < 1. The equality is from the definition of ε̃ in (B.10). The second inequality follows from our assumption in (5.5) that
and the first inequality in (B.2). To prove the last inequality in (B.15), we note that (B.1) implies
This together with (B.3) implies
| (B.16) |
where in the second inequality we use both sides of (B.2). In summary, we have that (B.15) holds. Now we verify that (B.14) holds. By (B.15) we have
which implies . For the right-hand side of (B.14) we have
| (B.17) |
where the last inequality is obtained by plugging in . Meanwhile, note that we have
| (B.18) |
where the first inequality is from our assumption in (5.5) that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2], while the last inequality is from (B.16). Combining (B.17) and (B.18), we then obtain (B.14). By (B.14) we further establish (B.13), i.e., the right-hand side of (B.12) is upper bounded by Δ, which implies
| (B.19) |
Furthermore, according to (B.6) we have
| (B.20) |
where in the last inequality we use the fact θ* and θ̅ are unit vectors. Now we solve for in (B.20). According to (B.19) and the fact that , on the right-hand side of (B.20) we have . Thus, we have
Further by solving for in the above inequality, we obtain
| (B.21) |
where in the second inequality we use the fact that Δ ≤ 1, which follows from its definition, while in the last inequality we plug in (B.12). Then combining (B.12) and (B.21), we obtain
| (B.22) |
Note that by (5.1) and the definition of θ̅ in (B.3), we have
Therefore, we have
where the second and third equalities follow from (B.5). Let χ̅ = ‖β̅(t+0.5)‖2 · ‖β*‖2. Plugging (B.22) into the right-hand side of the above inequality and then multiplying χ̅ on both sides, we obtain
| (B.23) |
For term (i) in (B.23), note that . By (B.3) and the definition that Δ = 〈θ̅, θ*〉, for term (i) we have
| (B.24) |
For term (ii) in (B.23), by the definition of ε̃ in (B.10) we have
| (B.25) |
where the last inequality is obtained from (B.2). Plugging (B.24) and (B.25) into (B.23), we obtain
| (B.26) |
Meanwhile, according to (5.1) we have that β̅(t+1) is obtained by truncating β̅(t+0.5), which implies
| (B.27) |
Subtracting two times both sides of (B.26) from (B.27), we obtain
We can easily verify that the above inequality implies
Taking square roots of both sides and utilizing the fact that , we obtain
| (B.28) |
where C > 0 is an absolute constant. Here we utilize the fact that and
both of which follow from our assumption that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2] < 1 in (5.5). By (B.28) we conclude the proof of Lemma 5.1.
B.2 Proof of Lemma 5.2
In the following, we prove (5.8) and (5.9) for the maximization and gradient ascent implementation of the M-step correspondingly.
Proof of (5.8)
To prove (5.8) for the maximization implementation of the M-step (Algorithm 2), note that by the self-consistency property (McLachlan and Krishnan, 2007) we have
| (B.29) |
Hence, β* satisfies the following first-order optimality condition
where ∇1Q(·, ·) denotes the gradient taken with respect to the first variable. In particular, it implies
| (B.30) |
Meanwhile, by (5.1) and the definition of M(·) in (3.1), we have
Hence we have the following first-order optimality condition
which implies
| (B.31) |
Combining (B.30) and (B.31), we then obtain
which implies
| (B.32) |
In the following, we establish upper and lower bounds for both sides of (B.32) correspondingly. By applying Condition Lipschitz-Gradient-1(γ1, ℬ), for the right-hand side of (B.32) we have
| (B.33) |
where the last inequality is from (3.3). Meanwhile, for the left-hand side of (B.32), we have
| (B.34) |
| (B.35) |
by (3.6) in Condition Concavity-Smoothness(μ, ν, ℬ). By adding (B.34) and (B.35), we obtain
| (B.36) |
Plugging (B.33) and (B.36) into (B.32), we obtain
which implies (5.8) in Lemma 5.2.
Proof of (5.9)
We turn to prove (5.9). The self-consistency property in (B.29) implies that β* is the maximizer of Q(·; β*). Furthermore, (3.5) and (3.6) in Condition Concavity-Smoothness(μ, ν, ℬ) ensure that −Q(·; β*) is μ-smooth and ν-strongly convex. By invoking standard optimization results for minimizing strongly convex and smooth objective functions, e.g., in Nesterov (2004), for stepsize η = 2/(ν + μ), we have
| (B.37) |
i.e., the gradient ascent step decreases the distance to β* by a multiplicative factor. Hence, for the gradient ascent implementation of the M-step, i.e., M(·) defined in (3.2), we have
| (B.38) |
where the last inequality is from (B.37) and (3.4) in Condition Lipschitz-Gradient-2(γ2, ℬ). Plugging η = 2/(ν + μ) into (B.38), we obtain
which implies (5.9). Thus, we conclude the proof of Lemma 5.2.
B.3 Auxiliary Lemma for Proving Theorem 3.4
The following lemma characterizes the initialization step in line 2 of Algorithm 4.
Lemma B.1
Suppose that we have ‖βinit − β*‖2 ≤ κ · ‖β*‖2 for some κ ∈ (0, 1). Assuming that ŝ ≥ 4 · (1 + κ)2/(1 − κ)2 · s*, we have .
Proof
Following the same proof of Lemma 5.1 with both β̅(t+0.5) and β(t+0.5) replaced with βinit, β̅(t+1) replaced with β(0) and 𝒮̂(t+0.5) replaced with 𝒮̂init, we reach the conclusion.
B.4 Proof of Lemma 3.6
Proof
Recall that Q(·; ·) is the expectation of Qn(·; ·). According to (A.1) and (3.1), we have
with ωβ(·) being the weight function defined in (A.1), which together with (A.2) implies
| (B.39) |
Recall yi is the i-th realization of Y, which follows the mixture distribution. For any u > 0, we have
| (B.40) |
Based on (B.39), we apply the symmetrization result in Lemma D.4 to the right-hand side of (B.40). Then we have
| (B.41) |
where ξ1, …, ξn are i.i.d. Rademacher random variables that are independent of y1, …, yn. Then we invoke the contraction result in Lemma D.5 by setting
where u is the variable of the moment generating function in (B.40). From the definition of ωβ(·) in (A.1) we have |2 · ωβ(yi) − 1| ≤ 1, which implies
Therefore, by Lemma D.5 we obtain
| (B.42) |
for the right-hand side of (B.41), where j ∈ {1, …, d}. Here note that in Gaussian mixture model we have , where zi is a Rademacher random variable and υi,j ~ N(0, σ2). Therefore, according to Example 5.8 in Vershynin (2010) we have and ‖υi,j‖ψ2 ≤ C · σ. Hence by Lemma D.1 we have
Since |ξi · yi,j| = |yi,j|, ξi · yi,j and yi,j have the same ψ2-norm. Because ξi is a Rademacher random variable independent of yi,j, we have 𝔼(ξi · yi,j) = 0. By Lemma 5.5 in Vershynin (2010), we obtain
| (B.43) |
Hence, for the right-hand side of (B.42) we have
| (B.44) |
Here the last inequality is obtained by plugging (B.43) with u′ = 2 · u/n and u′ = −2 · u/n respectively into the two terms. Plugging (B.44) into (B.42) and then into (B.41), we obtain
By Chernoff bound we have that, for all u > 0 and υ > 0,
Minimizing the right-hand side over u we obtain
Setting the right-hand side to be δ, we have that
holds for some absolute constants C, C′ and C″, which completes the proof of Lemma 3.6.
B.5 Proof of Lemma 3.9
In the sequel, we first establish the result for the maximization implementation of the M-step and then for the gradient ascent implementation.
Maximization Implementation
For the maximization implementation we need to estimate the inverse covariance matrix Θ* = Σ−1 with the CLIME estimator Θ̂ defined in (A.5). The following lemma from Cai et al. (2011) quantifies the statistical rate of convergence of Θ̂. Recall that ‖·‖1,∞ is defined as the maximum of the row ℓ1-norms of a matrix.
Lemma B.2
For Σ = Id and in (A.5), we have that
holds with probability at least 1 − ∞, where C and C′ are positive absolute constants.
Proof
See the proof of Theorem 6 in Cai et al. (2011) for details.
Now we are ready to prove (3.21) of Lemma 3.9.
Proof
Recall that Q(·; ·) is the expectation of Qn(·; ·). According to (A.3) and (3.1), we have
| (B.45) |
with ωβ(·, ·) being the weight function defined in (A.3), which together with (A.6) implies
Here Θ̂ is the CLIME estimator defined in (A.5). For notational simplicity, we denote
| (B.46) |
It is worth noting that both ω̅i and ω̅ depend on β. Note that we have
| (B.47) |
Analysis of Term (i)
For term (i) in (B.47), recall that by our model assumption we have Σ = Id, which implies Θ* = Σ−1 = Id. By Lemma B.2, for a sufficiently large sample size n, we have that
| (B.48) |
holds with probability at least 1 − δ/4.
Analysis of Term (ii)
For term (ii) in (B.47), we have that for u > 0,
| (B.49) |
where ξ1,…, ξn are i.i.d. Rademacher random variables. The last inequality follows from Lemma D.4. Furthermore, for the right-hand side of (B.49), we invoke the contraction result in Lemma D.5 by setting
where u is the variable of the moment generating function in (B.49). From the definitions in (A.3) and (B.46) we have |ω̅i| = |2 · ωβ(xi, yi) − 1 | ≤ 1, which implies
By Lemma D.5, we obtain
| (B.50) |
for j ∈ {1,…, d} on the right-hand side of (B.49). Recall that in mixture of regression model we have yi = zi · 〈β*, xi〉 + υi, where zi is a Rademacher random variable, υi ~ N(0, σ2), and xi ~ N(0, Id). Then by Example 5.8 in Vershynin (2010) we have ‖zi · 〈β*, xi〉‖ψ2 = ‖〈β*, xi〉‖ψ2 ≤ C · ‖β*‖2 and ‖υi,j‖ψ2 ≤ C′ · σ. By Lemma D.1 we further have
Note that we have ‖xi,j‖ψ2 ≤ C″ since xi,j ~ N(0, 1). Therefore, by Lemma D.2 we have
Since ξi is a Rademacher random variable independent of yi · xi,j, we have 𝔼(ξi ·yi ·xi,j) = 0. Hence, by Lemma 5.15 in Vershynin (2010), we obtain
| (B.51) |
for all |u′| ≤ C′/ max {, 1}. Hence we have
| (B.52) |
The last inequality is obtained by plugging (B.51) with u′ = 2·u/n and u′ = −2·u/n correspondingly into the two terms. Here . Plugging (B.52) into (B.50) and further into (B.49), we obtain
By Chernoff bound we have that, for all υ > 0 and ,
Minimizing over u on its right-hand side we have that, for ,
Setting the right-hand side of the above inequality to be δ/2, we have that
| (B.53) |
holds with probability at least 1 − δ/2 for a sufficiently large n.
Analysis of Term (iii)
For term (iii) in (B.47), by Lemma B.2 we have
| (B.54) |
with probability at least 1 − δ/4 for a sufficiently large n.
Analysis of Term (iv)
For term (iv) in (B.47), recall that by (B.45) and (B.46) we have
which implies
| (B.55) |
where the first inequality follows from triangle inequality and ‖·‖∞ ≤ ‖·‖2, the second inequality is from the proof of (5.8) in Lemma 5.2 with β̅(t+0.5) replaced with β and the fact that γ1/ν < 1 in (5.8), and the third inequality holds since in Condition Statistical-Error(ε, δ, s, n, ℬ) we suppose that ℬ ∈ ℬ, and for mixture of regression model ℬ is specified in (3.20).
Plugging (B.48), (B.53), (B.54) and (B.55) into (B.47), by union bound we have that
holds with probability at least 1 − δ. Therefore, we conclude the proof of (3.21) in Lemma 3.9.
Gradient Ascent Implementation
In the following, we prove (3.22) in Lemma 3.9.
Proof
Recall that Q(·; ·) is the expectation of Qn(·; ·). According to (A.3) and (3.2), we have
with ωβ(·, ·) being the weight function defined in (A.3), which together with (A.7) implies
| (B.56) |
Here η > 0 denotes the stepsize in Algorithm 3.
Analysis of Term (i)
For term (i) in (B.56), we redefine ω̅i and ω̅ in (B.46) as
| (B.57) |
Note that |ω̅i| = |2 · ωβ(xi, yi)| ≤ 2. Following the same way we establish the upper bound of term (ii) in (B.47), under the new definitions of ω̅i and ω̅ in (B.57) we have that
holds with probability at least 1 − δ/2.
Analysis of Term (ii)
For term (ii) in (B.56), we have
For term (ii).a, recall by our model assumption we have 𝔼(X·X⊤) = Id and xi’s are the independent realizations of X. Hence we have
Since Xj, Xk ~ N(0, 1), according to Example 5.8 in Vershynin (2010) we have ‖Xj‖ψ2 = ‖Xk‖ψ2 ≤ C. By Lemma D.2, Xj ·Xk is a sub-exponential random variable with ‖Xj ·Xk‖ψ1 ≤ C′. Moreover, we have ‖Xj ·Xk − 𝔼(Xj · Xk)‖ψ1 ≤ C″ by Lemma D.3. Then by Bernstein’s inequality (Proposition 5.16 in Vershynin (2010)) and union bound, we have
for 0 < υ ≤ C′ and a sufficiently large sample size n. Setting its right-hand side to be δ/2, we have
holds with probability at least 1 − δ/2. For term (ii).b we have , since in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume ‖β‖0 ≤ s. Furthermore, we have ‖β‖2 ≤ ‖β*‖2 + ‖β* − β‖2 ≤ (1 + 1/32) · ‖β*‖2, because in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume that ℬ ∈ ℬ, and for mixture of regression model ℬ is specified in (3.20).
Plugging the above results into the right-hand side of (B.56), by union bound we have that
holds with probability at least 1 − δ. Therefore, we conclude the proof of (3.22) in Lemma 3.9.
B.6 Proof of Lemma 3.12
Proof
Recall that Q(·; ·) is the expectation of Qn(·; ·). Let Xobs be the subvector corresponding to the observed component of X in (2.26). By (A.8) and (3.2), we have
with η > 0 being the stepsize in Algorithm 3, which together with (A.11) implies
| (B.58) |
Here Kβ(·, ·) ∈ ℝd×d and mβ(·, ·) ∈ ℝd are defined in (A.9) and (A.10). To ease the notation, let
| (B.59) |
Let the entries of Z∈ℝd be i.i.d. Bernoulli random variables, each of which is zero with probability pm, and z1,…, zn be the n independent realizations of Z. If zi,j = 1, then xi,j is observed, otherwise xi,j is missing (unobserved).
Analysis of Term (i)
For term (i) in (B.58), recall that we have
where diag(1 − Z) denotes the d × d diagonal matrix with the entries of 1 − Z ∈ ℝd on its diagonal, and ⊙ denotes the Hadamard product. Therefore, by union bound we have
| (B.60) |
According to Lemma B.3 we have, for all j ∈ {1,…, d}, m̅j is a zero-mean sub-Gaussian random variable with ‖m̅j‖ψ2 ≤ C · (1 + κ · r). Then by Lemmas D.2 and D.3 we have
Meanwhile, since |1 − Zj | ≤ 1, it holds that ‖(1 − Zj) · m̅j‖ψ2 ≤ ‖m̅j‖ψ2 ≤ C · (1 + κ · r). Similarly, by Lemmas D.2 and D.3 we have
In addition, for the first term on the right-hand side of (B.60) we have
where the first inequality is from Lemma D.3 and the second inequality follows from Example 5.8 in Vershynin (2010) since [diag(1−Z)]j,k ∈ [0, 1]. Applying Hoeffding’s inequality to the first term on the right-hand side of (B.60) and Bernstein’s inequality to the second and third terms, we obtain
Setting the two terms on the right-hand side of the above inequality to be δ/6 and δ/3 respectively, we have that
holds with probability at least 1 − δ/2, for sufficiently large constant C″ and sample size n.
Analysis of Term (ii)
For term (ii) in (B.58) we have
where the first inequality holds because in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume ‖β‖0 ≤ s, while the last inequality holds since in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume that β ∈ ℬ, and for regression with missing covariates ℬ is specified in (3.27).
Analysis of Term (iii)
For term (iii) in (B.58), by union bound we have
| (B.61) |
for a sufficiently large n. Here the second inequality is from Bernstein’s inequality, since we have
Here the first inequality follows from Lemmas D.2 and D.3 and the second inequality is from Lemma B.3 and the fact that , because X ~ N(0, Id) and V ~ N(0, σ2) are independent. Setting the right-hand side of (B.61) to be δ/2, we have that
holds with probability at least 1 − δ/2 for sufficiently large constant C and sample size n.
Finally, plugging in the upper bounds for terms (i)–(iii) in (B.58) we have that
holds with probability at least 1 − δ. Therefore, we conclude the proof of Lemma 3.12.
The following auxiliary lemma used in the proof of Lemma 3.12 characterizes the sub-Gaussian property of m̅ defined in (B.59).
Lemma B.3
Under the assumptions of Lemma 3.12, the random vector m̅ ∈ ℝd defined in (B.59) is sub-Gaussian with mean zero and ‖m̅j‖ψ2 ≤ C · (1 + κ · r) for all j ∈ {1,…, d}, where C > 0 is a sufficiently large absolute constant.
Proof
In the proof of Lemma 3.12, Z’s entries are i.i.d. Bernoulli random variables, which satisfy ℙ(Zj = 0) = pm for all j ∈ {1,…, d}. Meanwhile, from the definitions in (A.9) and (B.59), we have
Since we have Y = 〈β*, X〉 + V and − 〈β, Z ⊙ X〉 = − 〈β, X〉 + 〈β, (1 − Z) ⊙ X〉, it holds that
| (B.62) |
Since X ~ N(0, Id) is independent of Z, we can verify that 𝔼(m̅j) = 0. In the following we provide upper bounds for the ψ2-norms of terms (i)–(iv) in (B.62).
Analysis of Terms (i) and (ii)
For term (i), since Xj ~ N(0, 1), we have ‖Zj · Xj‖ψ2 ≤ ‖Xj‖ψ2 ≤ C, where the last inequality follows from Example 5.8 in Vershynin (2010). Meanwhile, for term (ii) in (B.62), we have that for any u > 0,
| (B.63) |
where the second last inequality is from the fact that ‖V‖ψ2 ≤ C″ · σ, while the last inequality holds because we have
| (B.64) |
According to Lemma 5.5 in Vershynin (2010), by (B.63) we conclude that the ψ2-norm of term (ii) in (B.62) is upper bounded by some absolute constant C > 0.
Analysis of Term (iii)
For term (iii), by the same conditioning argument in (B.63), we have
| (B.65) |
where we utilize the fact that ‖〈β* − β, X〉‖ψ2 ≤ C′· ‖β* − β‖2, since . Note that in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume β ∈ ℬ, and for regression with missing covariates ℬ is specified in (3.27). Hence we have
where the first inequality follows from (3.27), and the second inequality follows from our assumption that ‖β*‖2/σ ≤ r on the maximum signal-to-noise ratio. By invoking (B.64), from (B.65) we have
which further implies that the ψ2-norm of term (iii) in (B.62) is upper bounded by C″ · κ · r.
Analysis of Term (iv)
For term (iv), note that 〈β, (1−Z) ⊙ X〉 = 〈(1−Z) ⊙ β, X〉. By invoking the same conditioning argument in (B.63), we have
| (B.66) |
where we utilize the fact that ‖〈(1 − Z) ⊙ β, X〉‖ψ2 ≤ C′· ‖(1 − Z) ⊙ β‖2 when conditioning on Z. By (B.66) we have that the ψ2-norm of term (iv) in (B.62) is upper bounded by C″ > 0.
Combining the above upper bounds for the ψ2-norms of terms (i)–(iv) in (B.62), by Lemma D.1 we have that
holds for a sufficiently large absolute constant C″>0. Therefore, we establish Lemma B.3.
C Proof of Results for Inference
In the following, we provide the detailed proof of the theoretical results for asymptotic inference in §4. We first present the proof of the general results, and then the proof for specific models.
C.1 Proof of Lemma 2.1
Proof
In the sequel we establish the two equations in (2.12) respectively.
Proof of the First Equation
According to the definition of the lower bound function Qn(·; ·) in (2.5), we have
| (C.1) |
Here kβ(z | yi) is the density of the latent variable Z conditioning on the observed variable Y = yi under the model with parameter β. Hence we obtain
| (C.2) |
where hβ(yi) is the marginal density function of Y evaluated at yi, and the second equality follows from the fact that
| (C.3) |
since kβ(z | yi) is the conditional density. According to the definition in (2.2), we have
| (C.4) |
where the last equality is from (2.1). Comparing (C.2) and (C.4), we obtain ∇1Qn(β; β)=∇ℓn(β)/n.
Proof of the Second Equation
For the second equation in (2.12), by (C.1) and (C.3) we have
By calculation we obtain
| (C.5) |
Here ⊗ denotes the vector outer product. Note that in (C.5) we have
| (C.6) |
where v⊗2 denotes v ⊗ v. Here S̃β(·, ·) is defined as
| (C.7) |
i.e., the score function for the complete likelihood, which involves both the observed variable Y and the latent variable Z. Meanwhile, in (C.5) we have
| (C.8) |
where in the last equality we utilize the fact that
| (C.9) |
because hβ(·) is the marginal density function of Y. By (C.3) and (C.7), for the right-hand side of (C.8) we have
| (C.10) |
Plugging (C.10) into (C.8) and then plugging (C.6) and (C.8) into (C.5) we obtain
Setting β = β* in the above equality, we obtain
| (C.11) |
Meanwhile, for β = β*, by the property of Fisher information we have
| (C.12) |
Here the last equality is obtained by taking β = β* in
where the second equality follows from (C.9), the third equality follows from (C.3), while the second last equality follows from (C.7). Combining (C.11) and (C.12), by the law of total variance we have
| (C.13) |
In the following, we prove
| (C.14) |
According to (C.1) we have
| (C.15) |
Let ℓ̃(β)=log fβ(Y, Z) be the complete log-likelihood, which involves both the observed variable Y and the latent variable Z, and Ĩ (β) be the corresponding Fisher information. By setting β = β* in (C.15) and taking expectation, we obtain
| (C.16) |
Since S̃β(Y, Z) defined in (C.7) is the score function for the complete log-likelihood ℓ̃(β), according to the relationship between the score function and Fisher information, we have
which together with (C.16) implies (C.14). By further plugging (C.14) into (C.13), we obtain
which establishes the first equality of the second equation in (2.12). In addition, the second equality of the second equation in (2.12) follows from the property of Fisher information. Thus, we conclude the proof of Lemma 2.1.
C.2 Auxiliary Lemmas for Proving Theorems 4.6 and 4.7
In this section, we lay out several lemmas on the Dantzig selector defined in (2.10). The first lemma, which is from Bickel et al. (2009), characterizes the cone condition for the Dantzig selector.
Lemma C.1
Any feasible solution w in (2.10) satisfies
where w(β, λ) is the minimizer of (2.10), 𝒮w is the support of w and is its complement.
Proof
See Lemma B.3 in Bickel et al. (2009) for a detailed proof.
In the sequel, we focus on analyzing w(β̂, λ). The results for w(β̂0, λ) can be obtained similarly. The next lemma characterizes the restricted eigenvalue of Tn(β̂), which is defined as
| (C.17) |
Here is the support of w* defined in (4.1).
Lemma C.2
Under Assumption 4.5 and Conditions 4.1, 4.3 and 4.4, for a sufficiently large sample size n, we have ρ̂min ≥ ρmin/2 > 0 with high probability, where ρmin is specified in (4.4).
Proof
By triangle inequality we have
| (C.18) |
where 𝒞 is defined in (C.17).
Analysis of Term (i)
For term (i) in (C.18), by (4.4) in Assumption 4.5 we have
| (C.19) |
Analysis of Term (ii)
For term (ii) in (C.18) we have
| (C.20) |
By the definition of 𝒞 in (C.17), for any v ∈ 𝒞 we have
Therefore, the right-hand side of (C.20) is upper bounded by
For term (ii).a, by Lemma 2.1 and Condition 4.3 we have
where the last equality is from (4.6) in Assumption 4.5, since for λ specified in (4.5) we have
For term (ii).b, by Conditions 4.1 and 4.4 we have
where the last equality is also from (4.6) in Assumption 4.5, since for λ specified in (4.5) we have
Hence, term (ii) in (C.18) is oℙ(1). Since ρmin is an absolute constant, for a sufficiently large n we have that term (ii) is upper bounded by ρmin/2 with high probability. Further by plugging this and (C.19) into (C.18), we conclude that ρ̂min ≥ ρmin/2 holds with high probability.
The next lemma quantifies the statistical accuracy of w(β̂, λ), where w(·, ·) is defined in (2.10).
Lemma C.3
Under Assumption 4.5 and Conditions 4.1–4.4, for λ specified in (4.5) we have that
holds with high probability. Here ρmin is specified in (4.4), while w* and are defined (4.1).
Proof
For λ specified in (4.5), we verify that w* is a feasible solution in (2.10) with high probability. For notational simplicity, we define the following event
| (C.21) |
By the definition of w* in (4.1), we have [I(β*)]γ,α − [I(β*)]γ,γ · w* = 0. Hence, we have
| (C.22) |
where the last inequality is from triangle inequality and Hölder’s inequality. Note that we have
| (C.23) |
On the right-hand side, by Lemma 2.1 and Condition 4.3 we have
while by Conditions 4.1 and 4.4 we have
Plugging the above equations into (C.23) and further plugging (C.23) into (C.22), by (4.5) we have
holds with high probability for a sufficiently large absolute constant C ≥ 1. In other words, ℰ occurs with high probability. The subsequent proof will be conditioning on ℰ and the following event
| (C.24) |
which also occurs with high probability according to Lemma C.2. Here ρ̂min is defined in (C.17).
For notational simplicity, we denote w(β̂, λ)= ŵ. By triangle inequality we have
| (C.25) |
where the last inequality follows from (2.10) and (C.21). Moreover, by (C.17) and (C.24) we have
| (C.26) |
Meanwhile, by Lemma C.1 we have
Plugging the above inequality into (C.26), we obtain
| (C.27) |
Note that by (C.25), the left-hand side of (C.27) is upper bounded by
| (C.28) |
By (C.27) and (C.28), we then obtain conditioning on ℰ and ℰ′, both of which hold with high probability. Note that the proof for w(β̂0, λ) follows similarly. Therefore, we conclude the proof of Lemma C.3.
C.3 Proof of Lemma 5.3
Proof
Our proof strategy is as follows. First we prove that
| (C.29) |
where β* is the true parameter and w* is defined in (4.1). We then prove
| (C.30) |
where [I(β*)]α|γ is defined in (4.2). Throughout the proof, we abbreviate w(β̂0, λ) as ŵ0. Also, it is worth noting that our analysis is under the null hypothesis where β* = [α*, (γ*)⊤]⊤ with α* = 0.
Proof of (C.29)
For (C.29), by the definition of the decorrelated score function in (2.9) we have
By the mean-value theorem, we obtain
| (C.31) |
where we have as defined in (2.8), and β♯ is an intermediate value between β* and β̂0.
Analysis of Term (i)
For term (i) in (C.31), we have
| (C.32) |
For the right-hand side of (C.32), we have
| (C.33) |
By Lemma C.3, we have , where λ is specified in (4.5). Meanwhile, we have
where the first equality follows from the self-consistency property (McLachlan and Krishnan, 2007) that β* = argmaxβ Q(β; β*), which gives ∇1Q(β*; β*) = 0. Here the last equality is from Condition 4.2. Therefore, (C.33) implies
where the second equality is from in (4.6) of Assumption 4.5. Thus, by (C.32) we conclude that term (i) in (C.31) equals
Analysis of Term (ii)
By triangle inequality, term (ii) in (C.31) is upper bounded by
| (C.34) |
By Hölder’s inequality, term (ii).a in (C.34) is upper bounded by
| (C.35) |
where the first inequality holds because ŵ0 = w(β̂0, λ) is a feasible solution in (2.10). Meanwhile, Condition 4.1 gives ‖β̂ − β*‖1 = Oℙ(ζEM). Also note that by definition we have (β̂0)α = (β*)α = 0, which implies ‖β̂0 − β*‖1 ≤ ‖β̂ − β*‖1. Hence, we have
| (C.36) |
which implies the first equality in (C.35). The last equality in (C.35) follows from in (4.6) of Assumption 4.5. Note that term (ii).b in (C.34) is upper bounded by
| (C.37) |
For the first term on the right-hand side of (C.37), by triangle inequality we have
By Condition 4.4, we have
| (C.38) |
and
| (C.39) |
where the last inequality in (C.39) holds because β♯ is defined as an intermediate value between β* and β̂0. Further by plugging (C.36) into (C.38), (C.39) as well as the second term on the right-hand side of (C.37), we have that term (ii).b in (C.34) is Oℙ [ζL · (ζEM)2]. Moreover, by our assumption in (4.6) of Assumption 4.5 we have
Thus, we conclude that term (ii).b is . Similarly, term (ii).c in (C.34) is upper bounded by
| (C.40) |
By triangle inequality and Lemma C.3, the first term in (C.40) is upper bounded by
Meanwhile, for the second and third terms in (C.40), by the same analysis for term (ii).b in (C.34) we have
By (4.6) in Assumption 4.5, since , we have
Therefore, term (ii).c in (C.34) is . Hence, by (C.34) we conclude that term (ii) in (C.31) is . Combining the analysis for terms (i) and (ii) in (C.31), we then obtain (C.29). In the sequel, we turn to prove the second part on asymptotic normality.
Proof of (C.30)
Note that by Lemma 2.1, we have
| (C.41) |
Recall that ℓn(β*) is the log-likelihood function defined in (2.2). Hence, [1, −(w*)⊤] · ∇ℓn(β*)/n is the average of n independent random variables. Meanwhile, the score function has mean zero at β*, i.e., 𝔼[∇ℓn(β*)] = 0. For the variance of the rescaled average in (C.41), we have
Here the second equality is from the fact that the covariance of the score function equals the Fisher information (up to renormalization). Hence, the variance of each item in the average in (C.41) is
where the second and third equalities are from (4.1) and (4.2). Hence, by the central limit theorem we obtain (C.30). Finally, combining (C.29) and (C.30) by invoking Slutsky’s theorem, we obtain
which concludes the proof of Lemma 5.3.
C.4 Proof of Lemma 5.4
Proof
Throughout the proof, we abbreviate w(β̂0, λ) as ŵ0. Our proof is under the null hypothesis where β* = [α*, (γ*)⊤]⊤ with α* = 0. Recall that w* is defined in (4.1). Then by the definitions of [Tn(β̂0)]α|γand [I(β*)]α|γ in (2.11) and (4.2), we have
By triangle inequality, we have
| (C.42) |
Analysis of Term (i)
For term (i) in (C.42), by Lemma 2.1 and triangle inequality we have
| (C.43) |
For term (i).a in (C.43), by Condition 4.4 we have
| (C.44) |
Note that we have (β̂0)α = (β*)α = 0 by definition, which implies ‖β̂0 − β*‖1 ≤ ‖β̂ − β*‖1. Hence, by Condition 4.1 we have
| (C.45) |
Moreover, combining (C.44) and (C.45), by (4.6) in Assumption 4.5 we have
for λ specified in (4.5). Hence we obtain
| (C.46) |
Meanwhile, for term (i).b in (C.43) we have
| (C.47) |
where the second last equality follows from Condition 4.3, while the last equality holds because our assumption in (4.6) of Assumption 4.5 implies
for λ specified in (4.5).
Analysis of Term (ii)
For term (ii) in (C.42), by Lemma 2.1 and triangle inequality, we have
| (C.48) |
By Hölder’s inequality, term (ii).a in (C.48) is upper bounded by
By Lemma C.3, we have . Meanwhile, we have
where the second equality follows from (C.46) and (C.47). Therefore, term (ii).a is oℙ(1), since (4.6) in Assumption 4.5 implies . Meanwhile, by Hölder’s inequality, term (ii).b in (C.48) is upper bounded by
| (C.49) |
By Lemma C.3, we have . Meanwhile, we have 𝔼β* [Tn(β*)] = −I(β*) by Lemma 2.1. Furthermore, (4.4) in Assumption 4.5 implies
| (C.50) |
where C>0 is some absolute constant. Therefore, from (C.49) we have that term (ii).b in (C.48) is . By (4.6) in Assumption 4.5, we have . Thus, we conclude that term (ii).b is oℙ(1). For term (ii).c, we have
Here the first and second inequalities are from Hölder’s inequality and triangle inequality, the first equality follows from (C.46) and (C.47), and the second equality holds because (4.6) in Assumption 4.5 implies
for λ specified in (4.5).
Analysis of Term (iii)
For term (iii) in (C.42), by (2.12) in Lemma 2.1 we have
| (C.51) |
For term (iii).a in (C.51), we have
| (C.52) |
For ‖ŵ0‖1 we have , where the equality holds because by Lemma C.3 we have . Meanwhile, on the right-hand side of (C.52) we have
Here the last equality is from (C.46) and (C.47). Hence, term (iii).a in (C.51) is . Note that
From (4.6) in Assumption 4.5 we have, for λ specified in (4.5), terms (i)–(iv) are all upper bounded by . Hence, we conclude term (iii).a in (C.51) is oℙ(1). Also, for term (iii).b in (C.51), we have
where the last equality follows from Lemma C.3 and (C.50). Moreover, by (4.6) in Assumption 4.5 we have . Therefore, we conclude that term (iii).b in (C.51) is oℙ(1). Combining the above analysis for terms (i)–(iii) in (C.42), we obtain
Thus we conclude the proof of Lemma 5.4.
C.5 Proof of Lemma 5.5
Proof
Our proof strategy is as follows. Recall that w* is defined in (4.1). First we prove
| (C.53) |
Here note that [I(β*)]α|γ is defined in (4.2) and β̂ = [α̂, γ̂⊤]⊤ is the estimator attained by the high dimensional EM algorithm (Algorithm 4). Then we prove
| (C.54) |
Proof of (C.53)
For notational simplicity, we define
| (C.55) |
By the definition of Sn(·,·) in (2.9), we have S̅n (β̂, ŵ) = Sn (β̂, λ). Let β̃ = (α*, γ̂⊤)⊤. The Taylor expansion of S̅n (β̂, ŵ) takes the form
| (C.56) |
where β♯ is an intermediate value between β̂ and β̃. By (C.55) and the definition of Tn(·) in (2.8), we have
| (C.57) |
By (C.57) and the definition in (2.18), we have
Further by (C.56) we obtain
| (C.58) |
Analysis of Term (i)
For term (i) in (C.58), in the sequel we first prove
| (C.59) |
By the definition of [I(β*)]α|γ in (4.1) and the definition of w* in (C.53), we have
Together with (C.57), by triangle inequality we obtain
| (C.60) |
Note that term (i).a in (C.60) is upper bounded by
| (C.61) |
Here the inequality is from Lemma 2.1 and triangle inequality. For the first term on the right-hand side of (C.61), by Conditions 4.1 and 4.4 we have
| (C.62) |
For the second term on the right-hand side of (C.61), by Condition 4.3 we have
| (C.63) |
Plugging (C.62) and (C.63) into (C.61), we obtain
| (C.64) |
By (4.6) in Assumption 4.5, we have
for λ specified in (4.5). Hence, we conclude that term (i).a in (C.60) is o𝕡(1). Meanwhile, for term (i).b in (C.60), by triangle inequality we have
| (C.65) |
For the first term on the right hand-side of (C.65), we have
The inequality is from Hölder’s inequality, the second last equality is from Lemma C.3 and (C.64), and the last equality holds because (4.6) in Assumption 4.5 implies
for λ specified in (4.5). For the second term on the right hand-side of (C.65), we have
where the second inequality is from Lemma C.3, while the last equality follows from (4.4) and (4.6) in Assumption 4.5. For the third term on the right hand-side of (C.65), we have
Here the inequality is from Hölder’s inequality, the first equality is from (C.64) and the last equality holds because (4.6) in Assumption 4.5 implies
for λ specified in (4.5). By (C.65), we conclude that term (i).b in (C.60) is o𝕡(1). Hence, we obtain (C.59). Furthermore, for term (i) in (C.58), we then replace β̂0 = (0, γ̂⊤)⊤ with β̃ = (α*, γ̂⊤)⊤ and ŵ0 = w (β̂0, λ) with ŵ = w (β̂, λ) in the proof of (C.29) in §C.3. We obtain
which further implies
by (C.59) and Slutsky’s theorem.
Analysis of Term (ii)
Now we prove that term (ii) in (C.58) is o𝕡(1). We have
| (C.66) |
For term (ii).a in (C.66), by Condition 4.1 and (4.6) in Assumption 4.5 we have
| (C.67) |
Meanwhile, by replacing β̂ with β̃ = (α*, γ̂⊤)⊤ in the proof of (C.59) we obtain
| (C.68) |
Combining (C.59) and (C.68), for term (ii).b in (C.66) we obtain
which together with (C.67) implies that term (ii) in (C.58) is o𝕡(1). Plugging the above results into terms (i) and (ii) in (C.58), we obtain (C.53).
Proof of (C.54)
By (C.30) in §C.3, we have
which implies (C.54). Finally, combining (C.53) and (C.54) with Slutsky’s theorem, we obtain
Thus we conclude the proof of Lemma 5.5.
C.6 Proof of Lemma 4.8
Proof
According to Algorithm 4, the final estimator β̂=β(T) has ŝ nonzero entries. Meanwhile, we have ‖β*‖0 = s* ≤ ŝ. Hence, we have . Invoking (3.19) in Theorem 3.7, we obtain ζEM.
For Gaussian mixture model, the maximization implementation of the M-step takes the form
where ωβ(·) is defined in (A.1). Meanwhile, we have
Hence, we have ‖Mn(β) − M(β)‖∞ = ‖∇1Qn(β; β)−∇1Q(β; β)‖∞. By setting δ = 2/d in Lemma 3.6, we obtain ζG.
C.7 Proof of Lemma 4.9
Proof
Recall that for Gaussian mixture model we have
where ωβ (·) is defined in (A.1). Hence, by calculation we have
| (C.69) |
| (C.70) |
For notational simplicity we define
| (C.71) |
Then by the definition of Tn(·) in (2.8), from (C.69) and (C.70) we have
Applying the symmetrization result in Lemma D.4 to the right-hand side, we have that for u > 0,
| (C.72) |
where ξ1, …, ξn are i.i.d. Rademacher random variables that are independent of y1, …, yn. Then we invoke the contraction result in Lemma D.5 by setting
where u is the variable of the moment generating function in (C.72). By the definition in (C.71) we have |νβ*(yi)| ≤ 4/σ2, which implies
Therefore, applying the contraction result in Lemma D.5 to the right-hand side of (C.72), we obtain
| (C.73) |
Note that 𝔼β*(ξi · yi,j · yi,k) = 0, since ξi is a Rademacher random variable independent of yi,j · yi,k. Recall that in Gaussian mixture model we have , where zi is a Rademacher random variable and υi,j ~ N(0, σ2). Hence, by Example 5.8 in Vershynin (2010), we have and ‖ υi,j‖ψ2 ≤ C · σ. Therefore, by Lemma D.1 we have
| (C.74) |
Since |ξi · yi,j · yi,k|=|yi,j · yi,k|, by definition ξi · yi,j · yi,k and yi,j · yi,k have the same ψ1-norm. By Lemma D.2 we have
Then by Lemma 5.15 in Vershynin (2010), we obtain
| (C.75) |
for all . Note that on the right-hand side of (C.73), we have
| (C.76) |
By plugging (C.75) into the right-hand side of (C.76) with u′ = u · 4/(σ2 · n) and u′ = −u · 4/(σ2 · n), from (C.73) we have that for any j, k ∈ {1, …, d},
| (C.77) |
Therefore, by Chernoff bound we have that, for all υ > 0 and ,
where the last inequality is from (C.77). By minimizing its right-hand side over u, we conclude that for ,
Setting the right-hand side to be δ, we have that
holds with probability at least 1 − δ. By setting δ = 2/d, we conclude the proof of Lemma 4.9.
C.8 Proof of Lemma 4.10
Proof
For any j, k ∈ {1, …, d}, by the mean-value theorem we have
| (C.78) |
where β♯ is an intermediate value between β and β*. According to (C.69), (C.70) and the definition of Tn(·) in (2.8), by calculation we have
where
| (C.79) |
For notational simplicity, we define the following event
where τ > 0 will be specified later. By maximal inequality we have
| (C.80) |
Let ℰ̅ be the complement of ℰ. On the right-hand side of (C.80) we have
| (C.81) |
Analysis of Term (i)
For term (i) in (C.81), we have
where the last inequality is from union bound. By (C.79) we have |ν̅β♯(yi)|≤16/σ4. Thus we obtain
Taking υ = 16 · τ3/σ4, we have that the right-hand side is zero and hence term (i) in (C.81) is zero.
Analysis of Term (ii)
Meanwhile, for term (ii) in (C.81) by maximal inequality we have
Furthermore, by (C.74) in the proof of Lemma 4.8, we have that yi,j is sub-Gaussian with . Therefore, by Lemma 5.5 in Vershynin (2010) we have
To ensure the right-hand side is upper bounded by δ, we set τ to be
| (C.82) |
Finally, by (C.80), (C.81) and maximal inequality we have
for υ = 16 · τ3/σ4 with τ specified in (C.82). By setting δ = 2 · d−4 and plugging (C.82) into (C.78), we conclude the proof of Lemma 4.10.
C.9 Proof of Lemma 4.12
By the same proof of Lemma 4.8 in §C.6, we obtain ζEM by invoking Theorem 3.10. To obtain ζG, note that for the gradient implementation of the M-step (Algorithm 3), we have
Hence, we obtain ‖∇1Qn(β*; β*) − ∇1Q(β*; β*)‖∞ = 1/η · ‖Mn (β*) − M(β*)‖ ∞. Setting δ = 4/d and s = s* in (3.22) of Lemma 3.9, we then obtain ζG.
C.10 Proof of Lemma 4.13
Proof
Recall that for mixture of regression model, we have
where ωβ(·) is defined in (A.3). Hence, by calculation we have
| (C.83) |
| (C.84) |
For notational simplicity we define
| (C.85) |
Then by the definition of Tn(·) in (2.8), from (C.83) and (C.84) we have
Applying the symmetrization result in Lemma D.4 to the right-hand side, we have that for u > 0,
| (C.86) |
where ξ1, …, ξn are i.i.d. Rademacher random variables, which are independent of x1, …, xn and y1, …, yn. Then we invoke the contraction result in Lemma D.5 by setting
where u is the variable of the moment generating function in (C.86). By the definition in (C.85) we have |νβ*(xi, yi)| ≤ 4/σ2, which implies
Therefore, applying Lemma D.5 to the right-hand side of (C.86), we obtain
| (C.87) |
For notational simplicity, we define the following event
Let ℰ̅ be the complement of ℰ. We consider the following tail probability
| (C.88) |
Analysis of Term (i)
For term (i) in (C.88), we have
Here note that , because ξi is a Rademacher random variable independent of xi and yi. Recall that for mixture of regression model we have yi = zi · 〈β*, xi〉 + υi, where zi is a Rademacher random variable, xi ~ N(0, Id) and υi ~ N(0, σ2). According to Example 5.8 in Vershynin (2010), we have ‖zi · 〈β*, xi〉 · 𝟙 {‖xi‖∞ ≤ τ}‖ψ2 = ‖〈β*, xi〉 · 𝟙 {‖xi‖∞ ≤ τ}‖ψ2 ≤ τ · ‖β*‖1 and ‖υi · 𝟙 {‖xi‖∞ ≤ τ}‖ψ2 ≤ ‖υi‖ψ2 ≤ C · σ. Hence, by Lemma D.1 we have
| (C.89) |
By the definition of ψ1-norm, we have . Further by applying Lemma D.2 to its right-hand side with Z1 = Z2 = yi · 𝟙 {‖xi‖∞≤τ}, we obtain
where the last inequality follows from (C.89). Therefore, by Bernstein’s inequality (Proposition 5.16 in Vershynin (2010)), we obtain
| (C.90) |
for all and a sufficiently large sample size n.
Analysis of Term (ii)
For term (ii) in (C.88), by union bound we have
Moreover, xi,j is sub-Gaussian with ‖xi,j‖ψ2 = C. Thus, by Lemma 5.5 in Vershynin (2010) we have
| (C.91) |
Plugging (C.90) and (C.91) into (C.88), we obtain
| (C.92) |
Note that (C.87) is obtained by applying Lemmas D.4 and D.5 with ϕ(υ) = exp(u·υ). Since Lemmas D.4 and D.5 allow any increasing convex function ϕ(·), similar results hold correspondingly. Hence, applying Panchenko’s theorem in Lemma D.6 to (C.87), from (C.92) we have
Furthermore, by union bound we have
| (C.93) |
To ensure the right-hand side is upper bounded by δ, we set the second term on the right-hand side of (C.93) to be δ/2. Then we obtain
Let the first term on the right-hand side of (C.93) be upper bounded by δ/2 and plugging in τ, we then obtain
Therefore, by setting δ = 4 · e/d we have that
holds with probability at least 1 – 4 · e/d, which completes the proof of Lemma 4.13.
C.11 Proof of Lemma 4.14
Proof
For any j, k ∈ {1, …, d}, by the mean-value theorem we have
| (C.94) |
where β♯ is an intermediate value between β and β*. According to (C.83), (C.84) and the definition of Tn(·) in (2.8), by calculation we have
where
| (C.95) |
For notational simplicity, we define the following events
where τ > 0 and τ′ > 0 will be specified later. By union bound we have
| (C.96) |
Let ℰ̅ and ℰ̅′ be the complement of ℰ and ℰ′ respectively. On the right-hand side we have
| (C.97) |
Analysis of Term (i)
For term (i) in (C.97), we have
To avoid confusion, note that υi is the noise in mixture of regression model, while υ appears in the tail bound. By applying union bound to the right-hand side of the above inequality, we have
By (C.95) we have |ν̅β♯(xi, yi)| ≤ 16/σ4. Hence, we obtain
| (C.98) |
Recall that in mixture of regression model we have yi = zi · 〈β*, xi〉 + υi, where zi is a Rademacher random variable, xi ~ N(0, Id) and υi ~ N(0, σ2). Hence, we have
Taking υ = 16 · (τ · ‖β*‖1 + τ′)3 · τ3/σ4, we have that the right-hand side of (C.98) is zero. Hence term (i) in (C.97) is zero.
Analysis of Term (ii)
For term (ii) in (C.97), by union bound we have
Moreover, we have that xi,j is sub-Gaussian with ‖xi,j‖ψ2 = C. Therefore, by Lemma 5.5 in Vershynin (2010) we have
Analysis of Term (iii)
Since υi is sub-Gaussian with ‖υi‖ψ2 = C · σ, by Lemma 5.5 in Vershynin (2010) and union bound, for term (iii) in (C.97) we have
To ensure the right-hand side of (C.97) is upper bounded by δ, we set τ and τ′ to be
| (C.99) |
to ensure terms (ii) and (iii) are upper bounded by δ/2 correspondingly. Finally, by (C.96), (C.97) and union bound we have
for υ = 16 · (τ · ‖β*‖1 + τ′)3 · τ3/σ4 with τ and τ′ specified in (C.99). Then by setting δ = 4 · d−4 and plugging it into (C.99), we have
which together with (C.94) concludes the proof of Lemma 4.14.
D Auxiliary Results
In this section, we lay out several auxiliary lemmas. Lemmas D.1–D.3 provide useful properties of sub-Gaussian random variables. Lemmas D.4 and D.5 establish the symmetrization and contraction results. Lemma D.6 is Panchenko’s theorem. For more details of these results, see Vershynin (2010); Boucheron et al. (2013).
Lemma D.1
Let Z1, …, Zk be the k independent zero-mean sub-Gaussian random variables, for we have , where C > 0 is an absolute constant.
Lemma D.2
For Z1 and Z2 being two sub-Gaussian random variables, Z1 · Z2 is a sub-exponential random variable with
where C > 0 is an absolute constant.
Lemma D.3
For Z being sub-Gaussian or sub-exponential, it holds that ‖Z − 𝔼Z‖ψ2 ≤ 2 · ‖Z‖ψ2 or ‖Z − 𝔼Z‖ψ1 ≤ 2 · ‖Z‖ψ1 correspondingly.
Lemma D.4
Let z1, …, zn be the n independent realizations of the random vector Z ∈ 𝒵 and ℱ be a function class defined on 𝒵. For any increasing convex function ϕ(·) we have
where ξ1, .…, ξn are i.i.d. Rademacher random variables that are independent of z1, …, zn.
Lemma D.5
Let z1, …, zn be the n independent realizations of the random vector Z ∈ 𝒵 and ℱ be a function class defined on 𝒵. We consider the Lipschitz functions ψi(·) (i = 1, …, n) that satisfy
and ψi(0) = 0. For any increasing convex function ϕ(·) we have
where ξ1, …, ξn are i.i.d. Rademacher random variables that are independent of z1, …, zn.
Lemma D.6
Suppose that Z1 and Z2 are two random variables that satisfy 𝔼[ϕ(Z1)] ≤ 𝔼[ϕ(Z2)] for any increasing convex function ϕ(·). Assuming that ℙ(Z1 ≥ υ) ≤ C · exp(−C′ · υα) (α ≥ 1) holds for all υ ≥ 0, we have ℙ(Z2 ≥ υ) ≤ C · exp(1 − C′ · υα).
References
- Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research. 2014a;15:2773–2832. [Google Scholar]
- Anandkumar A, Ge R, Janzamin M. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. arXiv preprint arXiv:1411.1488. 2014b [Google Scholar]
- Anandkumar A, Ge R, Janzamin M. Provable learning of overcomplete latent variable models: Semi-supervised and unsupervised settings. arXiv preprint arXiv:1408.0553. 2014c [Google Scholar]
- Balakrishnan S, Wainwright MJ, Yu B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156. 2014 [Google Scholar]
- Bartholomew DJ, Knott M, Moustaki I. Latent variable models and factor analysis: A Unified Approach. Vol. 899. Wiley; 2011. [Google Scholar]
- Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]
- Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies. 2014;81:608–650. [Google Scholar]
- Belloni A, Chernozhukov V, Wei Y. Honest confidence regions for a regression parameter in logistic regression with a large number of controls. arXiv preprint arXiv:1304.3969. 2013 [Google Scholar]
- Bickel PJ. One-step Huber estimates in the linear model. Journal of the American Statistical Association. 1975;70:428–434. [Google Scholar]
- Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37:1705–1732. [Google Scholar]
- Boucheron S, Lugosi G, Massart P. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013. [Google Scholar]
- Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
- Candès E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351. [Google Scholar]
- Chaganty AT, Liang P. Spectral experts for estimating mixtures of linear regressions. arXiv preprint arXiv:1306.3729. 2013 [Google Scholar]
- Chaudhuri K, Dasgupta S, Vattani A. Learning mixtures of Gaussians using the k-means algorithm. arXiv preprint arXiv:0912.0086. 2009 [Google Scholar]
- Chrétien S, Hero AO. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics. 2008;12:308–326. [Google Scholar]
- Dasgupta S, Schulman L. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research. 2007;8:203–226. [Google Scholar]
- Dasgupta S, Schulman LJ. A two-round variant of EM for Gaussian mixtures. In Uncertainty in artificial intelligence, 2000. 2000 [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1977;39:1–38. [Google Scholar]
- Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research. 2014;15:2869–2909. [Google Scholar]
- Khalili A, Chen J. Variables selection in finite mixture of regression models. Journal of the American Statistical Association. 2007;102:1025–1038. [Google Scholar]
- Knight K, Fu W. Asymptotics for Lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
- Lee JD, Sun DL, Sun Y, Taylor JE. Exact inference after model selection via the Lasso. arXiv preprint arXiv:1311.6238. 2013 [Google Scholar]
- Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the Lasso. The Annals of Statistics. 2014;42:413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Loh P-L, Wainwright MJ. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression; IEEE International Symposium on Information Theory, 2012; 2012. [Google Scholar]
- McLachlan G, Krishnan T. The EM algorithm and extensions. Vol. 382. Wiley; 2007. [Google Scholar]
- Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2010;72:417–473. [Google Scholar]
- Meinshausen N, Meier L, Bühlmann P. p-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
- Nesterov Y. Introductory lectures on convex optimization:A basic course. Vol. 87. Springer; 2004. [Google Scholar]
- Nickl R, van de Geer S. Confidence sets in sparse regression. The Annals of Statistics. 2013;41:2852–2876. [Google Scholar]
- Städler N, Bühlmann P, van de Geer S. ℓ1-penalization for mixture regression models. TEST. 2010;19:209–256. [Google Scholar]
- Taylor J, Lockhart R, Tibshirani RJ, Tibshirani R. Post-selection adaptive inference for least angle regression and the Lasso. arXiv preprint arXiv:1401.3889. 2014 [Google Scholar]
- Tseng P. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of Operations Research. 2004;29:27–44. [Google Scholar]
- van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42:1166–1202. [Google Scholar]
- van der Vaart AW. Asymptotic statistics. Vol. 3. Cambridge University Press; 2000. [Google Scholar]
- Vershynin R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. 2010 [Google Scholar]
- Wasserman L, Roeder K. High-dimensional variable selection. The Annals of Statistics. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu CFJ. On the convergence properties of the EM algorithm. The Annals of Statistics. 1983;11:95–103. [Google Scholar]
- Yi X, Caramanis C, Sanghavi S. Alternating minimization for mixed linear regression. arXiv preprint arXiv:1310.3745. 2013 [Google Scholar]
- Yuan X-T, Li P, Zhang T. Gradient hard thresholding pursuit for sparsity-constrained optimization. arXiv preprint arXiv:1311.5750. 2013 [Google Scholar]
- Yuan X-T, Zhang T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]
- Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:217–242. [Google Scholar]


