Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jun 12.
Published in final edited form as: Adv Neural Inf Process Syst. 2015;28:2512–2520.

High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality

Zhaoran Wang *, Quanquan Gu *, Yang Ning *, Han Liu *
PMCID: PMC5467221  NIHMSID: NIHMS752348  PMID: 28615917

Abstract

We provide a general theory of the expectation-maximization (EM) algorithm for inferring high dimensional latent variable models. In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. With an appropriate initialization, this algorithm converges at a geometric rate and attains an estimator with the (near-)optimal statistical rate of convergence. (ii) Based on the obtained estimator, we propose new inferential procedures for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. For a broad family of statistical models, our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions. Our theory is supported by thorough numerical results.

1 Introduction

The expectation-maximization (EM) algorithm (Dempster et al., 1977) is the most popular approach for calculating the maximum likelihood estimator of latent variable models. Nevertheless, due to the nonconcavity of the likelihood function of latent variable models, the EM algorithm generally only converges to a local maximum rather than the global one (Wu, 1983). On the other hand, existing statistical guarantees for latent variable models are only established for global optima (Bartholomew et al., 2011). Therefore, there exists a gap between computation and statistics.

Significant progress has been made toward closing the gap between the local maximum attained by the EM algorithm and the maximum likelihood estimator (Wu, 1983; Tseng, 2004; McLachlan and Krishnan, 2007; Chrétien and Hero, 2008; Balakrishnan et al., 2014). In particular, Wu (1983) first establishes general sufficient conditions for the convergence of the EM algorithm. Tseng (2004); Chrétien and Hero (2008) further improve this result by viewing the EM algorithm as a proximal point method applied to the Kullback-Leibler divergence. See McLachlan and Krishnan (2007) for a detailed survey. More recently, Balakrishnan et al. (2014) establish the first result that characterizes explicit statistical and computational rates of convergence for the EM algorithm. They prove that, given a suitable initialization, the EM algorithm converges at a geometric rate to a local maximum close to the maximum likelihood estimator. All these results are established in the low dimensional regime where the dimension d is much smaller than the sample size n.

In high dimensional regimes where the dimension d is much larger than the sample size n, there exists no theoretical guarantee for the EM algorithm. In fact, when dn, the maximum likelihood estimator is in general not well defined, unless the models are carefully regularized by sparsity-type assumptions. Furthermore, even if a regularized maximum likelihood estimator can be obtained in a computationally tractable manner, establishing the corresponding statistical properties, especially asymptotic normality, can still be challenging because of the existence of high dimensional nuisance parameters. To address such a challenge, we develop a general inferential theory of the EM algorithm for parameter estimation and uncertainty assessment of high dimensional latent variable models. In particular, we make two contributions in this paper:

  • For high dimensional parameter estimation, we propose a novel high dimensional EM algorithm by attaching a truncation step to the expectation step (E-step) and maximization step (M-step). Such a truncation step effectively enforces the sparsity of the attained estimator and allows us to establish significantly improved statistical rate of convergence.

  • Based upon the estimator attained by the high dimensional EM algorithm, we propose a family of decorrelated score and Wald statistics for testing hypotheses for low dimensional components of the high dimensional parameter. The decorrelated Wald statistic can be further used to construct optimal valid confidence intervals for low dimensional parameters of interest.

Under a unified analytic framework, we establish simultaneous statistical and computational guarantees for the proposed high dimensional EM algorithm and the respective uncertainty assessment procedures. Let β* ∈ ℝd be the true parameter, s* be its sparsity level and {β(t)}t=0T be the iterative solution sequence of the high dimensional EM algorithm with T being the total number of iterations. In particular, we prove that:

  • Given an appropriate initialization βinit with relative error upper bounded by a constant κ ∈ (0, 1), i.e., ‖βinitβ*‖2/‖β*‖2 ≤ κ, the iterative solution sequence {β(t)}t=0T satisfies
    β(t)β*2Δ1·ρt/2Optimization Error+Δ2·s*· log dnOptimal RateStatistical Error (1.1)
    with high probability. Here ρ ∈(0, 1), and Δ1, Δ2 are quantities that possibly depend on ρ, κ and β*. As the optimization error term in (1.1) decreases to zero at a geometric rate with respect to t, the overall estimation error achieves the s*·log d/n statistical rate of convergence (up to an extra factor of log n), which is (near-)minimax-optimal. See Theorem 3.4 and the corresponding discussion for details.
  • The proposed decorrelated score and Wald statistics are asymptotically normal. Moreover, their limiting variances and the size of the respective confidence interval are optimal in the sense that they attain the semiparametric information bound for the low dimensional components of interest in the presence of high dimensional nuisance parameters. See Theorems 4.6 and 4.7 for details.

Our framework allows two implementations of the M-step: the exact maximization versus approximate maximization. The former one calculates the maximizer exactly, while the latter one conducts an approximate maximization through a gradient ascent step. Our framework is quite general. We illustrate its effectiveness by applying it to three high dimensional latent variable models, including Gaussian mixture model, mixture of regression model and regression with missing covariates.

Comparison with Related Work

A closely related work is by Balakrishnan et al. (2014), which considers the low dimensional regime where d is much smaller than n. Under certain initialization conditions, they prove that the EM algorithm converges at a geometric rate to some local optimum that attains the d/n statistical rate of convergence. They cover both maximization and gradient ascent implementations of the M-step, and establish the consequences for the three latent variable models considered in our paper under low dimensional settings. Our framework adopts their view of treating the EM algorithm as a perturbed version of gradient methods. However, to handle the challenge of high dimensionality, the key ingredient of our framework is the truncation step that enforces the sparsity structure along the solution path. Such a truncation operation poses significant challenges for both computational and statistical analysis. In detail, for computational analysis we need to carefully characterize the evolution of each intermediate solution’s support and its effects on the evolution of the entire iterative solution sequence. For statistical analysis, we need to establish a fine-grained characterization of the entrywise statistical error, which is technically more challenging than just establishing the ℓ2-norm error employed by Balakrishnan et al. (2014). In high dimensional regimes, we need to establish the s*·log d/n statistical rate of convergence, which is much sharper than their d/n rate when dn. In addition to point estimation, we further construct confidence intervals and hypothesis tests for latent variable models in the high dimensional regime, which have not been established before.

High dimensionality poses significant challenges for assessing the uncertainty (e.g., constructing confidence intervals and testing hypotheses) of the constructed estimators. For example, Knight and Fu (2000) show that the limiting distribution of the Lasso estimator is not Gaussian even in the low dimensional regime. A variety of approaches have been proposed to correct the Lasso estimator to attain asymptotic normality, including the debiasing method (Javanmard and Montanari, 2014), the desparsification methods (Zhang and Zhang, 2014; van de Geer et al., 2014) as well as instrumental variable-based methods (Belloni et al., 2012, 2013, 2014). Meanwhile, Lockhart et al. (2014); Taylor et al. (2014); Lee et al. (2013) propose the post-selection procedures for exact inference. In addition, several authors propose methods based on data splitting (Wasserman and Roeder, 2009; Meinshausen et al., 2009), stability selection (Meinshausen and Bühlmann, 2010) and ℓ2-confidence sets (Nickl and van de Geer, 2013). However, these approaches mainly focus on generalized linear models rather than latent variable models. In addition, their results heavily rely on the fact that the estimator is a global optimum of a convex program. In comparison, our approach applies to a much broader family of statistical models with latent structures. For these latent variable models, it is computationally infeasible to obtain the global maximum of the penalized likelihood due to the nonconcavity of the likelihood function. Unlike existing approaches, our inferential theory is developed for the estimator attained by the proposed high dimensional EM algorithm, which is not necessarily a global optimum to any optimization formulation.

Another line of research for the estimation of latent variable models is the tensor method, which exploits the structures of third or higher order moments. See Anandkumar et al. (2014a,b,c) and the references therein. However, existing tensor methods primarily focus on the low dimensional regime where dn. In addition, since the high order sample moments generally have a slow statistical rate of convergence, the estimators obtained by the tensor methods usually have a suboptimal statistical rate even for dn. For example, Chaganty and Liang (2013) establish the d6/n statistical rate of convergence for mixture of regression model, which is suboptimal compared with the d/n minimax lower bound. Similarly, in high dimensional settings, the statistical rates of convergence attained by tensor methods are significantly slower than the statistical rate obtained in this paper.

The three latent variable models considered in this paper have been well studied. Nevertheless, only a few works establish theoretical guarantees for the EM algorithm. In particular, for Gaussian mixture model, Dasgupta and Schulman (2000, 2007); Chaudhuri et al. (2009) establish parameter estimation guarantees for the EM algorithm and its extensions. For mixture of regression model, Yi et al. (2013) establish exact parameter recovery guarantees for the EM algorithm under a noiseless setting. For high dimensional mixture of regression model, Städler et al. (2010) analyze the gradient EM algorithm for the ℓ1-penalized log-likelihood. They establish support recovery guarantees for the attained local optimum but have no parameter estimation guarantees. In comparison with existing works, this paper establishes a general inferential framework for simultaneous parameter estimation and uncertainty assessment based on a novel high dimensional EM algorithm. Our analysis provides the first theoretical guarantee of parameter estimation and asymptotic inference in high dimensional regimes for the EM algorithm and its applications to a broad family of latent variable models.

Notation

Let A = [Ai,j] ∈ ℝd×d and v = (υ1, …, υd) ∈ ℝd. We define the ℓq-norm (q≥1) of v as vq=(j=1d|υj|q)1/q. Particularly, ‖v0 denotes the number of nonzero entries of v. For q ≥ 1, we define ‖Aq as the operator norm of A. Specifically, ‖A2 is the spectral norm. For a set 𝒮, |𝒮| denotes its cardinality. We denote the d×d identity matrix as Id. For index sets ℐ, 𝒥 ⊆ {1, …, d}, we define Aℐ,𝒥 ∈ℝd×d to be the matrix whose (i, j)-th entry is equal to Ai,j if i∈ℐ and j∈𝒥, and zero otherwise. We define v similarly. We denote ⊗ and ⊙ to be the outer product and Hadamard product between vectors. The matrix (p, q)-norm, i.e., ‖Ap,q, is obtained by taking the ℓp-norm of each row and then taking the ℓq-norm of the obtained row norms. Let supp(v) be the support of v, i.e., the index set of its nonzero entries. We use C,C′, … to denote generic absolute constants. The values of these constants may vary from line to line. In addition, we denote ‖·‖ψq (q ≥ 1) to be the Orlicz norm of random variables. We will introduce more notations in §2.2.

The rest of the paper is organized as follows. In §2 we present the high dimensional EM algorithm and the corresponding procedures for inference, and then apply them to three latent variable models. In §3 and §4, we establish the main theoretical results for computation, parameter estimation and asymptotic inference, as well as their implications for specific latent variable models. In §5 we sketch the proof of the main results. In §6 we present the numerical results. In §7 we conclude the paper.

2 Methodology

We first introduce the high dimensional EM Algorithm. Then we present the respective procedures for asymptotic inference. Finally, we apply the proposed methods to three latent variable models.

2.1 High Dimensional EM Algorithm

Before we introduce the proposed high dimensional EM Algorithm (Algorithm 1), we briefly review the classical EM algorithm. Let hβ(y) be the probability density function of Y ∈ 𝒴, where β ∈ ℝd is the model parameter. For latent variable models, we assume that hβ(y) is obtained by marginalizing over an unobserved latent variable Z ∈ 𝒵, i.e.,

hβ(y)=𝒵fβ(y,z)dz. (2.1)

Given the n observations y1, …, yn of Y, the EM algorithm aims at maximizing the log-likelihood

n(β)=i=1nlog hβ(yi). (2.2)

Due to the unobserved latent variable Z, it is difficult to directly evaluate ℓn(β). Instead, we turn to consider the difference between ℓn(β) and ℓn(β′). Let kβ(z | y) be the density of Z conditioning on the observed variable Y = y, i.e.,

kβ(z|y)=fβ(y,z)/hβ(y). (2.3)

According to (2.1) and (2.2), we have

1n·[n(β)n(β)]=1ni=1nlog [hβ(yi)/hβ(yi)]=1ni=1nlog [𝒵fβ(yi,z)hβ(yi)dz]=1ni=1nlog [𝒵kβ(z|yi)·fβ(yi,z)fβ(yi,z)dz]1ni=1n𝒵kβ(z|yi)·log [fβ(yi,z)fβ(yi,z)]dz, (2.4)

where the third equality follows from (2.3) and the inequality is obtained from Jensen’s inequality. On the right-hand side of (2.4) we have

1ni=1n𝒵kβ(z|yi)·log [fβ(yi,z)fβ(yi,z)]dz=1ni=1n𝒵kβ(z|yi)·log fβ(yi,z)dzQn(β;β)1ni=1n𝒵kβ(z|yi)·log fβ(yi,z)dz. (2.5)

We define the first term on the right-hand side of (2.5) to be Qn(β; β′). Correspondingly, we define its expectation to be Q(β; β′). Note the second term on the right-hand side of (2.5) doesn’t depend on β. Hence, given some fixed β′, we can maximize the lower bound function Qn(β; β′) over β to obtain a sufficiently large ℓn(β) − ℓn(β′). Based on such an observation, at the t-th iteration of the classical EM algorithm, we evaluate Qn(β; β(t)) at the E-step and then perform maxβ Qn(β; β(t)) at the M-step. See McLachlan and Krishnan (2007) for more details.

Algorithm 1.

High Dimensional EM Algorithm

1: Parameter: Sparsity Parameter ŝ, Maximum Number of Iterations T
2: Initialization: 𝒮̂init ← supp(βinit, ŝ), β(0) ← trunc(βinit, 𝒮̂init)
{supp(·, ·) and trunc(·, ·) are defined in (2.6) and (2.7)}
3: For t = 0 to T − 1
4:   E-step: Evaluate Qn(β; β(t))
5:   M-step: β(t+0.5)Mn(β(t)) {Mn(·) is implemented as in Algorithm 2 or 3}
6:   T-step: 𝒮̂(t+0.5) ← supp(β(t+0.5), ŝ), β(t+1) ← trunc(β(t+0.5), 𝒮̂(t+0.5))
7: End For
8: Output: β̂β(T)

The proposed high dimensional EM algorithm (Algorithm 1) is built upon the E-step and M-step (lines 4 and 5) of the classical EM algorithm. In addition to the exact maximization implementation of the M-step (Algorithm 2), we allow the gradient ascent implementation of the M-step (Algorithm 3), which performs an approximate maximization via a gradient ascent step. To handle the challenge of high dimensionality, in line 6 of Algorithm 1 we perform a truncation step (T-step) to enforce the sparsity structure. In detail, we define the supp(·, ·) function in line 6 as

supp(β,s): The set of index js corresponding to the top s largest|βj|s. (2.6)

Also, for an index set 𝒮 ⊆ {1, …, d}, we define the trunc(·, ·) function in line 6 as

[trunc(β,𝒮)]j={βjj𝒮,0j𝒮. (2.7)

Note that β(t+0.5) is the output of the M-step (line 5) at the t-th iteration of the high dimensional EM algorithm. To obtain β(t+1), the T-step (line 6) preserves the entries of β(t+0.5) with the top ŝ large magnitudes and sets the rest to zero. Here ŝ is a tuning parameter that controls the sparsity level (line 1). By iteratively performing the E-step, M-step and T-step, the high dimensional EM algorithm attains an ŝ-sparse estimator β̂ = β(T) (line 8). Here T is the total number of iterations.

It is worth noting that, the truncation strategy employed here and its variants are widely adopted in the context of sparse linear regression and sparse principal component analysis. For example, see Yuan and Zhang (2013); Yuan et al. (2013) and the references therein. Nevertheless, we incorporate this truncation strategy into the EM algorithm for the first time. Also, our analysis is significantly different from existing works.

Algorithm 2.

Maximization Implementation of the M-step

1: Input: β(t), Qn(β; β(t))
2: Output: Mn(β(t)) ← argmaxβ Qn(β; β(t))

Algorithm 3.

Gradient Ascent Implementation of the M-step

1: Input: β(t), Qn(β; β(t))
2: Parameter: Stepsize η > 0
3: Output: Mn(β(t)) ← β(t) + η · ∇Qn(β(t); β(t))
{The gradient is taken with respect to the first β(t) in Qn(β(t); β(t))}

2.2 Asymptotic Inference

In the sequel, we first introduce some additional notations. Then we present the proposed methods for asymptotic inference in high dimensions.

Notation

Let ∇1Q(β; β′) be the gradient with respect to β and ∇2Q(β; β′) be the gradient with respect to β′. If there is no confusion, we simply denote ∇Q(β; β′) = ∇1Q(β; β′) as in the previous sections. We define the higher order derivatives in the same manner, e.g., 1,22Q(β;β) is calculated by first taking derivative with respect to β and then with respect to β′. For β=(β1,β2)d with β1 ∈ ℝd1, β2 ∈ ℝd2 and d1 + d2 = d, we use notations such as vβ1 ∈ ℝd1 and Aβ1,β2 ∈ ℝd1×d2 to denote the corresponding subvector of v ∈ ℝd and the submatrix of A ∈ ℝd×d.

We aim to conduct asymptotic inference for low dimensional components of the high dimensional parameter β*. Without loss of generality, we consider a single entry of β*. In particular, we assume β* = [α*, (γ*)], where α* ∈ ℝ is the entry of interest, while γ* ∈ ℝd−1 is treated as the nuisance parameter. In the following, we construct two hypothesis tests, namely, the decorrelated score and Wald tests. Based on the decorrelated Wald test, we further construct valid confidence interval for α*. It is worth noting that, our method and theory can be easily generalized to perform statistical inference for an arbitrary low dimensional subvector of β*.

Decorrelated Score Test

For score test, we are primarily interested in testing H0 : α* = 0, since this null hypothesis characterizes the uncertainty in variable selection. Our method easily generalizes to H0 : α* = α0 with α0 ≠ 0. For notational simplicity, we define the following key quantity

Tn(β)=1,12Qn(β;β)+1,22Qn(β;β)d×d. (2.8)

Let β = (α, γ). We define the decorrelated score function Sn(·, ·) ∈ ℝ as

Sn(β,λ)=[1Qn(β;β)]αw(β,λ)·[1Qn(β;β)]γ. (2.9)

Here w(β, λ) ∈ ℝd−1 is obtained using the following Dantzig selector (Candès and Tao, 2007)

w(β,λ)=argminwd1w1, subject to [Tn(β)]γ,α[Tn(β)]γ,γ·wλ, (2.10)

where λ > 0 is a tuning parameter. Let β̂ = (α̂, γ̂), where β̂ is the estimator attained by the high dimensional EM algorithm (Algorithm 1). We define the decorrelated score statistic as

n·Sn(β^0,λ)/{[Tn(β^0)]α|γ}1/2,where β^0=(0,γ^), and [Tn(β^0)]α|γ=[1,w(β^0,λ)]·Tn(β^0)·[1,w(β^0,λ)]. (2.11)

Here we use β̂0 instead of β̂ since we are interested in the null hypothesis H0 : α* = 0. We can also replace β̂0 with β̂ and the theoretical results will remain the same. In §4 we will prove the proposed decorrelated score statistic in (2.11) is asymptotically N(0, 1). Consequently, the decorrelated score test with significance level δ ∈ (0, 1) takes the form

ψS(δ)=𝟙{n·Sn(β^0,λ)/{[Tn(β^0)]α|γ}1/2[Φ1(1δ/2),Φ1(1δ/2)]},

where Φ−1(·) is the inverse function of the Gaussian cumulative distribution function. If ψS(δ) = 1, we reject the null hypothesis H0 : α* = 0. Correspondingly, the associated p-value takes the form

p-value=2·[1Φ(|n·Sn(β^0,λ)/{[Tn(β^0)]α|γ}1/2|)].

The intuition for the decorrelated score statistic in (2.11) can be understood as follows. Since ℓn(β) is the log-likelihood, its score function is ∇ℓn(β) and the Fisher information at β* is I(β*) = −𝔼 β* [∇2n(β*)]/n, where 𝔼β*(·) means the expectation is taken under the model with parameter β*. The following lemma reveals the connection of ∇1Qn(·; ·) in (2.9) and Tn(·) in (2.11) with the score function and Fisher information.

Lemma 2.1

For the true parameter β* and any β ∈ ℝd, it holds that

1Qn(β;β)=n(β)/n, and 𝔼β*[Tn(β*)]=I(β*)=𝔼β*[2n(β*)]/n. (2.12)
Proof

See §C.1 for details.

Recall that the log-likelihood ℓn(β) defined in (2.2) is difficult to evaluate due to the unobserved latent variable. Lemma 2.1 provides a feasible way to calculate or estimate the corresponding score function and Fisher information, since Qn(·; ·) and Tn(·) have closed forms. The geometric intuition behind Lemma 2.1 can be understood as follows. By (2.4) and (2.5) we have

n(β)n(β)+n·Qn(β;β)i=1n𝒵kβ(z|yi)·log fβ(yi,z)dz. (2.13)

By (2.12), both sides of (2.13) have the same gradient with respect to β at β′ = β. Furthermore, by (2.5), (2.13) becomes an equality for β′ = β. Therefore, the lower bound function on the right-hand side of (2.13) is tangent to ℓn(β) at β′ = β. Meanwhile, according to (2.8), Tn(β) defines a modified curvature for the right-hand side of (2.13), which is obtained by taking derivative with respect to β, then setting β′ = β and taking the second order derivative with respect to β. The second equation in (2.12) shows that the obtained curvature equals the curvature of ℓn(β) at β = β* in expectation (up to a renormalization factor of n). Therefore, ∇1Qn(β; β) gives the score function and Tn(β*) gives a good estimate of the Fisher information I(β*). Since β* is unknown in practice, later we will use Tn(β̂) or Tn(β̂0) to approximate Tn(β*).

In the presence of the high dimensional nuisance parameter γ* ∈ ℝd−1, the classical score test is no longer applicable. In detail, the score test for H0 : α* = 0 relies on the following Taylor expansion of the score function ∂ℓn(·)/∂α

1n·n(β¯0)α=1n·n(β*)α+1n·2n(β*)αγ·(γ¯γ*)+R¯. (2.14)

Here β* = [0, (γ*)], denotes the remainder term and β̅0 = (0, γ̅), where γ̅ is an estimator of the nuisance parameter γ*. The asymptotic normality of 1/n·n(β¯0)/α in (2.14) relies on the fact that 1/n·n(β0*)/α and n·(γ¯γ*) are jointly normal asymptotically and is o(1). In low dimensional settings, such a necessary condition holds for γ̅ being the maximum likelihood estimator. However, in high dimensional settings, the maximum likelihood estimator can’t guarantee that is o(1), since ‖γ̅γ*‖2 can be large due to the curse of dimensionality. Meanwhile, for γ̅ being sparsity-type estimators, in general the asymptotic normality of n·(γ¯γ*) doesn’t hold. For example, let γ̅ be γ̂, where γ̂ ∈ ℝd−1 is the subvector of β̂, i.e., the estimator attained by the proposed high dimensional EM algorithm. Note that γ̂ has many zero entries due to the truncation step. As n → ∞, some entries of n·(γ^γ*) have limiting distributions with point mass at zero. Clearly, this limiting distribution is not Gaussian with nonzero variance. In fact, for a similar setting of high dimensional linear regression, Knight and Fu (2000) illustrate that for γ being a subvector of the Lasso estimator and γ* being the corresponding subvector of the true parameter, the limiting distribution of n·(γγ*) is not Gaussian.

The decorrelated score function defined in (2.9) successfully addresses the above issues. In detail, according to (2.12) in Lemma 2.1 we have

n·Sn(β^0,λ)=1n·n(β^0)α1n·w(β^0,λ)·n(β^0)γ. (2.15)

Intuitively, if we replace w(β̂0, λ) with w ∈ ℝd−1 that satisfies

w·2n(β*)2γ=2n(β*)αγ, (2.16)

we have the following Taylor expansion of the decorrelated score function

1n·n(β^0)αwn·n(β^0)γ=1n·n(β*)αwn·n(β*)γ(i)+1n·2n(β*)αγ·(γ^γ*)wn·2n(β*)2γ·(γ^γ*)(i)+R˜, (2.17)

where term (ii) is zero by (2.16). Therefore, we no longer require the asymptotic normality of γ̂γ*. Also, we will prove that the new remainder term in (2.17) is o(1), since γ̂ has a fast statistical rate of convergence. Now we only need to find the w that satisfies (2.16). Nevertheless, it is difficult to calculate the second order derivatives in (2.16), because it is hard to evaluate ℓn(·). According to (2.12), we use the submatrices of Tn(·) to approximate the derivatives in (2.16). Since [Tn(β)]γ,γ is not invertible in high dimensions, we use the Dantzig selector in (2.10) to approximately solve the linear system in (2.16). Based on this intuition, we can expect that n·Sn(β^0,λ) is asymptotically normal, since term (i) in (2.17) is a (rescaled) average of n i.i.d. random variables for which we can apply the central limit theorem. Besides, we will prove that −[Tn(β̂0)]α|γ in (2.11) is a consistent estimator of n·Sn(β^0,λ)’s asymptotic variance. Hence, we can expect that the decorrelated score statistic in (2.11) is asymptotically N(0, 1).

From a high-level perspective, we can view w(β̂0, λ)·∂ℓn(β̂0)/∂γ in (2.15) as the projection of ∂ℓn(β̂0)/∂α onto the span of ∂ℓn(β̂0)/∂γ, where w(β̂0, λ) is the projection coefficient. Intuitively, such a projection guarantees that in (2.15), Sn(β̂0, λ) is orthogonal (uncorrelated) with ∂ℓn(β̂0)/∂γ, i.e., the score function with respect to the nuisance parameter γ. In this way, the projection corrects the effects of the high dimensional nuisance parameter. According to this intuition of decorrelation, we name Sn(β̂0, λ) as the decorrelated score function.

Decorrelated Wald Test

Based on the decorrelated score test, we propose the decorrelated Wald test. In detail, recall that Tn(·) is defined in (2.8). Let

α¯(β^,λ)=α^{[Tn(β^)]α,αw(β^,λ)·[Tn(β^)]γ,α}1·Sn(β^,λ), (2.18)

where Sn(·, ·) is the decorrelated score function in (2.9), w(·, ·) is defined in (2.10) and β̂ = (α̂, γ̂) is the estimator obtained by Algorithm 1. For testing the null hypothesis H0 : α* = 0, we define the decorrelated Wald statistic as

n·α¯(β^,λ)·{[Tn(β^)]α|γ}1/2, (2.19)

where [Tn(β̂)]α|γ is defined by replacing β̂0 with β̂ in (2.11). In §4 we will prove that this statistic is asymptotically N(0, 1). Consequently, the decorrelated Wald test with significance level δ ∈ (0, 1) takes the form

ψW(δ)=𝟙{n·α¯(β^,λ)·{[Tn(β^)]α|γ}1/2[Φ1(1δ/2),Φ1(1δ/2)]},

where Φ−1(·) is the inverse function of the Gaussian cumulative distribution function. If ψW(δ) = 1, we reject the null hypothesis H0 : α* = 0. The associated p-value takes the form

p-value=2·[1Φ(|n·α¯(β^,λ)·{[Tn(β^)]α|γ}1/2|)].

In more general settings where α* is possibly nonzero, in §4.1 we will prove that n·[α¯(β^,λ)α*]. {−[Tn(β̂)]α|γ}1/2 is asymptotically N(0, 1). Hence, we construct the two-sided confidence interval for α* with confidence level 1 − δ as

[α¯(β^,λ)Φ1(1δ/2)n·[Tn(β^)]α|γ,α¯(β^,λ)+Φ1(1δ/2)n·[Tn(β^)]α|γ]. (2.20)

The intuition for the decorrelated Wald statistic defined in (2.19) can be understood as follows. For notational simplicity, we define

S¯n(β,ŵ)=[1Qn(β;β)]αŵ·[1Qn(β;β)]γ, where ŵ=w(β^,λ). (2.21)

By the definitions in (2.9) and (2.21), we have n(β̂, ŵ) = Sn(β̂, λ). According to the same intuition for the asymptotic normality of n·Sn(β^0,λ) in the decorrelated score test, we can easily establish the asymptotic normality of n·Sn(β^,λ)=n·S¯n(β^,ŵ). Based on the proof for the classical Wald test (van der Vaart, 2000), we can further establish the asymptotic normality of n·α¯, where α̱ is defined as the solution to

S¯n[(α,γ^),ŵ]=0. (2.22)

Nevertheless, it is difficult to calculate α̱. Instead, we consider the first order Taylor approximation

S¯n(β^,ŵ)+(αα^)·[S¯n(β^,ŵ)]α=0, where β^=(α^,γ^). (2.23)

Here ŵ is defined in (2.21) and the gradient is taken with respect to β in (2.21). According to (2.8) and (2.21), we have that in (2.23),

[S¯n(β^,ŵ)]α=[1,12Qn(β^;β^)+1,22Qn(β^;β^)]α,αŵ·[1,12Qn(β^;β^)+1,22Qn(β^;β^)]γ,α=[Tn(β^)]α,αŵ·[Tn(β^)]γ,α, where ŵ=w(β^,λ).

Hence, α̅(β̂, λ) defined in (2.18) is the solution to (2.23). Alternatively, we can view α̅(β̂, λ) as the output of one Newton-Raphson iteration applied to α̂. Since (2.23) approximates (2.22), intuitively α̅(β̂, λ) has similar statistical properties as α̱, i.e., the solution to (2.22). Therefore, we can expect that n·α¯(β^,λ) is asymptotically normal. Besides, we will prove that [Tn(β^0)]α|γ1 is a consistent estimator of the asymptotic variance of n·α¯(β^,λ). Thus, the decorrelated Wald statistic in (2.19) is asymptotically N(0, 1). It is worth noting that, although the intuition of the decorrelated Wald statistic is the same as the one-step estimator proposed by Bickel (1975), here we don’t require the n-consistency of the initial estimator α̂ to achieve the asymptotic normality of α̅(β̂, λ).

2.3 Applications to Latent Variable Models

In the sequel, we introduce three latent variable models as examples. To apply the high dimensional EM algorithm in §2.1 and the methods for asymptotic inference in §2.2, we only need to specify the forms of Qn(·; ·) defined in (2.5), Mn(·) in Algorithms 2 and 3, and Tn(·) in (2.8) for each model.

Gaussian Mixture Model

Let y1, …, yn be the n i.i.d. realizations of Y ∈ ℝd and

Y=Z·β*+V. (2.24)

Here Z is a Rademacher random variable, i.e., ℙ(Z = +1) = ℙ(Z = −1) = 1/2, and V ~ N(0, σ2 ·Id) is independent of Z, where σ is the standard deviation. We suppose σ is known. In high dimensional settings, we assume that β* ∈ ℝd is sparse. To avoid the degenerate case in which the two Gaussians in the mixture are identical, here we suppose that β* ≠ 0. See §A.1 for the detailed forms of Qn(·; ·), Mn(·) and Tn(·).

Mixture of Regression Model

We assume that Y ∈ ℝ and X ∈ ℝd satisfy

Y=Z·Xβ*+V, (2.25)

where X ~ N(0, Id), V ~ N(0, σ2) and Z is a Rademacher random variable. Here X, V and Z are independent. In the high dimensional regime, we assume β* ∈ ℝd is sparse. To avoid the degenerate case, we suppose β* ≠ 0. In addition, we assume that σ is known. See §A.2 for the detailed forms of Qn(·; ·), Mn(·) and Tn(·).

Regression with Missing Covariates

We assume that Y ∈ ℝ and X ∈ ℝd satisfy

Y=Xβ*+V, (2.26)

where X ~ N(0, Id) and V ~ N(0, σ2) are independent. We assume β* is sparse. Let x1, …, xn be the n realizations of X. We assume that each coordinate of xi is missing (unobserved) independently with probability pm ∈ [0, 1). We treat the missing covariates as the latent variable and suppose that σ is known. See §A.3 for the detailed forms of Qn(·; ·), Mn(·) and Tn(·).

3 Theory of Computation and Estimation

Before we lay out the main results, we introduce three technical conditions, which will significantly simplify our presentation. These conditions will be verified for the two implementations of the high dimensional EM algorithm and three latent variable models.

The first two conditions, proposed by Balakrishnan et al. (2014), characterize the properties of the population version lower bound function Q(·; ·), i.e., the expectation of Qn(·; ·) defined in (2.5). We define the respective population version M-step as follows. For the maximization implementation of the M-step (Algorithm 2), we define

M(β)=argmax βQ(β;β). (3.1)

For the gradient ascent implementation of the M-step (Algorithm 3), we define

M(β)=β+η·1Q(β;β), (3.2)

where η > 0 is the stepsize in Algorithm 3. Hereafter, we employ ℬ to denote the basin of attraction, i.e., the local region in which the proposed high dimensional EM algorithm enjoys desired statistical and computational guarantees.

Condition 3.1

We define two versions of this condition.

  • Lipschitz-Gradient-1(γ1, ℬ). For the true parameter β* and any β ∈ ℬ, we have
    1Q[M(β);β*]1Q[M(β);β]2γ1·ββ*2, (3.3)
    where M(·) is the population version M-step (maximization implementation) defined in (3.1).
  • Lipschitz-Gradient-2(γ2, ℬ). For the true parameter β* and any β ∈ ℬ, we have
    1Q(β;β*)1Q(β;β)2γ2·ββ*2. (3.4)

Condition 3.1 defines a variant of Lipschitz continuity for ∇1Q(·; ·). Note that in (3.3) and (3.4), the gradient is taken with respect to the first variable of Q(·; ·). Meanwhile, the Lipschitz continuity is with respect to the second variable of Q(·; ·). Also, the Lipschitz property is defined only between the true parameter β* and an arbitrary β ∈ ℬ, rather than between two arbitrary β’s. In the sequel, we will use (3.3) and (3.4) in the analysis of the two implementations of the M-step respectively.

Condition 3.2

Concavity-Smoothness(μ, ν, ℬ). For any β1, β2 ∈ ℬ, Q(·; β*) is μ-smooth, i.e.,

Q(β1;β*)Q(β2;β*)+(β1β2)·1Q(β2;β*)μ/2·β2β122, (3.5)

and ν-strongly concave, i.e.,

Q(β1;β*)Q(β2;β*)+(β1β2)·1Q(β2;β*)ν/2·β2β122. (3.6)

This condition indicates that, when the second variable of Q(·; ·) is fixed to be β*, the function is ‘sandwiched’ between two quadratic functions. Conditions 3.1 and 3.2 are essential to establishing the desired optimization results. In detail, Condition 3.2 ensures that standard convex optimization results for strongly convex and smooth objective functions, e.g., in Nesterov (2004), can be applied to −Q(·; β*). Since our analysis will not only involve Q(·; β*), but also Q(·; β) for all β ∈ ℬ, we also need to quantify the difference between Q(·; β*) and Q(·; β) by Condition 3.1. It suggests that, this difference can be controlled in the sense that ∇1Q(·; β) is Lipschitz with respect to β. Consequently, for any β ∈ ℬ, the behavior of Q(·; β) mimics that of Q(·; β*). Hence, standard convex optimization results can be adapted to analyze −Q(·; β) for any β ∈ ℬ.

The third condition characterizes the statistical error between the sample version and population version M-steps, i.e., Mn(·) defined in Algorithms 2 and 3, and M(·) in (3.1) and (3.2). Recall that ‖ · ‖0 denotes the total number of nonzero entries in a vector.

Condition 3.3

Statistical-Error(ε, δ, s, n, ℬ). For any fixed β ∈ ℬ with ‖β0s, we have that

M(β)Mn(β)ε (3.7)

holds with probability at least 1 − δ. Here ε > 0 possibly depends on δ, sparsity level s, sample size n, dimension d, as well as the basin of attraction ℬ.

In (3.7), the statistical error ε quantifies the ℓ-norm of the difference between the population version and sample version M-steps. Particularly, we constrain the input β of M(·) and Mn(·) to be s-sparse. Such a condition is different from the one used by Balakrishnan et al. (2014). In detail, they quantify the statistical error with the ℓ2-norm and don’t constrain the input of M(·) and Mn(·) to be sparse. Consequently, our subsequent statistical analysis differs from theirs. The reason we use the ℓ-norm is that, it characterizes the more refined entrywise statistical error, which converges at a fast rate of log d/n (possibly with extra factors depending on specific models). In comparison, the ℓ2-norm statistical error converges at a slow rate of d/n, which doesn’t decrease to zero as n increases when dn. Moreover, the fine-grained entrywise statistical error is crucial to quantifying the effects of the truncation step (line 6 of Algorithm 1) on the iterative solution sequence.

3.1 Main Results

Equipped with Conditions 3.1–3.3, we now lay out the computational and statistical results for the high dimensional EM algorithm. To simplify the technical analysis of the algorithm, we focus on its resampling version, which is illustrated in Algorithm 4.

Algorithm 4.

High Dimensional EM Algorithm with Resampling.

1: Parameter: Sparsity Parameter ŝ, Maximum Number of Iterations T
2: Initialization: 𝒮̂init ← supp(βinit, ŝ), β(0) ← trunc(βinit, 𝒮̂init),
    {supp(·, ·) and trunc(·, ·) are defined in (2.6) and (2.7)}
    Split the Dataset into T Subsets of Size n/T
    {Without loss of generality, we assume n/T is an integer}
3: For t = 0 to T − 1
4:   E-step: Evaluate Qn/T (β; β(t)) with the t-th Data Subset
5:   M-step: β(t+0.5)Mn/T(β(t))
  {Mn/T(·) is implemented as in Algorithm 2 or 3 with Qn(·; ·) replaced by Qn/T(·; ·)}
6:   T-step: 𝒮̂(t+0.5) ← supp(β(t+0.5), ŝ), β(t+1) ← trunc(β(t+0.5), 𝒮̂(t+0.5))
7: End For
8: Output: β̂β(T)

Theorem 3.4

For Algorithm 4, we define ℬ = {β : ‖ββ*‖2R}, where R = κ·‖β*‖2 for some κ∈(0, 1). We assume that Condition Concavity-Smoothness(μ, ν, ℬ) holds and ‖βinitβ*‖2R/2.

  • For the maximization implementation of the M-step (Algorithm 2), we suppose that Condition Lipschitz-Gradient-1(γ1, ℬ) holds with γ1/ν ∈ (0, 1) and
    ŝ=C·max {16(ν/γ11)2,4·(1+κ)2(1κ)2}·s*, (3.8)
    (ŝ+C/1κ·s*)·εmin {(1γ1/ν)2·R,(1κ)22·(1+κ)·β*2}, (3.9)
    where C ≥ 1 and C′ > 0 are absolute constants. Under Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) we have that, for t = 1, …, T,
    β(t)β*2(γ1/ν)t/2·ROptimization Error+(ŝ+C/1κ·s*)·ε1γ1/νStatistical Error (3.10)
    holds with probability at least 1 − δ, where C′ is the same constant as in (3.9).
  • For the gradient ascent implementation of the M-step (Algorithm 3), we suppose that Condition Lipschitz-Gradient-2(γ2, ℬ) holds with 1 − 2 · (ν − γ2)/(ν +μ) ∈ (0, 1) and the stepsize in Algorithm 3 is set to η = 2/(ν + μ). Meanwhile, we assume that
    ŝ=C·max {16{1/[12·(νγ2)/(ν+μ)]1}2,4·(1+κ)2(1κ)2}·s*, (3.11)
    (ŝ+C/1κ·s*)·εmin{(112·νγ2ν+μ)2·R,(1κ)22·(1+κ)·β*2}, (3.12)
    where C ≥ 1 and C′ > 0 are absolute constants. Under Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) we have that, for t = 1, …, T,
    β(t)β*2(12·νγ2ν+μ)t/2·ROptimization Error+(ŝ+C/1κ·s*)·ε112·(νγ2)/ν+μStatistical Error (3.13)
    holds with probability at least 1 − δ, where C′ is the same constant as in (3.12).
Proof

See §5.1 for a detailed proof.

The assumptions in (3.8) and (3.11) state that the sparsity parameter ŝ in Algorithm 4 is chosen to be sufficiently large and also of the same order as the true sparsity level s*. These assumptions, which will be used by Lemma 5.1 in the proof of Theorem 3.4, ensure that the error incurred by the truncation step can be upper bounded. In addition, as will be shown for specific models in §3.2, the error term ε in Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) decreases as sample size n increases. By the assumptions in (3.8) and (3.11), (ŝ+C/1κ·s*) is of the same order as s*. Therefore, the assumptions in (3.9) and (3.12) suggest the sample size n is sufficiently large such that s*·ε is sufficiently small. These assumptions guarantee that the entire iterative solution sequence remains within the basin of attraction ℬ in the presence of statistical error. The assumptions in (3.8), (3.9), (3.11), (3.12) will be more explicit as we specify the values of γ1, γ2, ν, μ and κ for specific models.

Theorem 3.4 illustrates that, the upper bound of the overall estimation error can be decomposed into two terms. The first term is the upper bound of optimization error, which decreases to zero at a geometric rate of convergence, because we have γ1/ν < 1 in (3.10) and 1 − 2 · (ν − γ2)/(ν + μ) < 1 in (3.13). Meanwhile, the second term is the upper bound of statistical error, which doesn’t depend on t. Since (ŝ+C/1κ·s*) is of the same order as s*, this term is proportional to s* ·ε, where ε is the entrywise statistical error between M(·) and Mn(·). We will prove that, for a variety of models and the two implementations of the M-step, ε is roughly of the order log d/n. (There may be extra factors attached to ε depending on each specific model.) Hence, the statistical error term is roughly of the order s*·log d/n. Consequently, for a sufficiently large t = T such that the optimization and statistical error terms in (3.10) or (3.13) are of the same order, the final estimator β̂ = β(T) attains a (near-)optimal s*·log d/n (possibly with extra factors) rate of convergence for a broad variety of high dimensional latent variable models.

3.2 Implications for Specific Models

To establish the corresponding results for specific models under the unified framework, we only need to establish Conditions 3.1–3.3 and determine the key quantities R, γ1, γ2, ν, μ, κ and ε. Recall that Conditions 3.1 and 3.2 and the models analyzed in our paper are identical to those in Balakrishnan et al. (2014). Meanwhile, note that Conditions 3.1 and 3.2 only involve the population version lower bound function Q(·; ·) and M-step M(·). Since Balakrishnan et al. (2014) prove that the quantities in Conditions 3.1 and 3.2 are independent of the dimension d and sample size n, their corresponding results can be directly adapted. To establish the final results, it still remains to verify Condition 3.3 for each high dimensional latent variable model.

Gaussian Mixture Model

The following lemma, which is proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for Gaussian mixture model. Recall that σ is the standard deviation of each individual Gaussian distribution within the mixture.

Lemma 3.5

Suppose we have ‖β*‖2/σ ≥ r, where r > 0 is a sufficiently large constant that denotes the minimum signal-to-noise ratio. There exists some absolute constant C > 0 such that Conditions Lipschitz-Gradient-1(γ1, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with

γ1=exp(C·r2),μ=ν=1,={β:ββ*2R} with R=κ·β*2,κ=1/4. (3.14)
Proof

See the proof of Corollary 1 in Balakrishnan et al. (2014) for details.

Now we verify Condition 3.3 for the maximization implementation of the M-step (Algorithm 2).

Lemma 3.6

For the maximization implementation of the M-step (Algorithm 2), we have that for a sufficiently large n and ℬ specified in (3.14), Condition Statistical-Error(ε, δ, s, n, ℬ) holds with

ε=C·(β*+σ)·log d+log(2/δ)n. (3.15)
Proof

See §B.4 for a detailed proof.

The next theorem establishes the implication of Theorem 3.4 for Gaussian mixture model.

Theorem 3.7

We consider the maximization implementation of M-step (Algorithm 2). We assume ‖β*‖2/σ ≥ r holds with a sufficiently large r > 0, and ℬ and R are as defined in (3.14). We suppose the initialization βinit of Algorithm 4 satisfies ‖βinitβ*‖2R/2. Let the sparsity parameter ŝ be

ŝ=C·max {16·[exp(C·r2)1]2,100/9}·s* (3.16)

with C specified in (3.14) and C′ ≥ 1. Let the total number of iterations T in Algorithm 4 be

T=log{C·R/[ΔGMM(s*)·log d/n]}C·r2/2,where ΔGMM(s*)=(ŝ+C·s*)·(β*+σ). (3.17)

Meanwhile, we suppose the dimension d is sufficiently large such that T in (3.17) is upper bounded by d, and the sample size n is sufficiently large such that

C·ΔGMM(s*)·log d·Tnmin{[1exp(C·r2/2)]2·R,9/40·β*2}. (3.18)

We have that, with probability at least 1 − 2 · d−1/2, the final estimator β̂ = β(T) satisfies

β^β*2C·ΔGMM(s*)1exp(C·r2/2)·log d·Tn. (3.19)
Proof

First we plug the quantities in (3.14) and (3.15) into Theorem 3.4. Recall that Theorem 3.4 requires Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ). Thus we need to replace δ and n with δ/T and n/T in the definition of ε in (3.15). Then we set δ = 2 · d−1/2. Since T is specified in (3.17) and the dimension d is sufficiently large such that Td, we have log [2/(δ/T)] ≤ log d in the definition of ε. By (3.16) and (3.18), we can then verify the assumptions in (3.8) and (3.9). Finally, by plugging in T in (3.17) into (3.10) and taking t = T, we can verify that in (3.9) the optimization error term is smaller than the statistical error term up to a constant factor. Therefore, we obtain (3.19).

To see the statistical rate of convergence with respect to s*, d and n, for the moment we assume that R, r, ‖β*‖, ‖β*‖2 and σ are absolute constants. From (3.16) and (3.17), we obtain ŝ = C·s* and therefore ΔGMM(s*)=C·s*, which implies T=C·log [C·n/(s*·log d)]. Hence, according to (3.19) we have that, with high probaibility,

β^β*2C·s*·log d·log nn.

Because the minimax lower bound for estimating an s*-sparse d-dimensional vector is s*·log d/n, the rate of convergence in (3.19) is optimal up to a factor of log n. Such a logarithmic factor results from the resampling scheme in Algorithm 4, since we only utilize n/T samples within each iteration. We expect that by directly analyzing Algorithm 1 we can eliminate such a logarithmic factor, which however incurs extra technical complexity for the analysis.

Mixture of Regression Model

The next lemma, proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for mixture of regression model. Recall that β* is the regression coefficient and σ is the standard deviation of the Gaussian noise.

Lemma 3.8

Suppose ‖β*‖2/σ ≥ r, where r > 0 is a sufficiently large constant that denotes the required minimum signal-to-noise ratio. Conditions Lipschitz-Gradient-1(γ1, ℬ), Lipschitz-Gradient-2(γ2, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with

γ1(0,1/2),γ2(0,1/4),μ=ν=1,
={β:ββ*2R} with R=κ·β*2,κ=1/32. (3.20)
Proof

See the proof of Corollary 3 in Balakrishnan et al. (2014) for details.

The following lemma establishes Condition 3.3 for the two implementations of the M-step.

Lemma 3.9

For ℬ specified in (3.20), we have the following results.

  • For the maximization implementation of the M-step (line 5 of Algorithm 4), we have that Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
    ε=C·[max {β*22+σ2,1}+β*2]·log d+log(4/δ)n (3.21)
    for sufficiently large sample size n and absolute constant C > 0.
  • For the gradient ascent implementation, Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
    ε=C·η·max{β*22+σ2,1,s·β*2}·log d+log(4/δ)n (3.22)
    for sufficiently large sample size n and C > 0, where η denotes the stepsize in Algorithm 3.
Proof

See §B.5 for a detailed proof.

The next theorem establishes the implication of Theorem 3.4 for mixture of regression model.

Theorem 3.10

Let ‖β*‖2/σ ≥ r with r > 0 sufficiently large. Assuming that ℬ and R are specified in (3.20) and the initialization βinit satisfies ‖βinitβ*‖2R/2, we have the following results.

  • For the maximization implementation of the M-step (Algorithm 2), let ŝ and T be
    ŝ=C·max{16,132/31}·s*,T=log{C·R/[Δ1MR(s*)·log d/n]}log 2,where Δ1MR(s*)=(ŝ+C·s*)·[max{β*22+σ2,1}+β*2], and C1.
    We suppose d and n are sufficiently large such that Td and
    C·Δ1MR(s*)·log d·Tnmin{(11/2)2·R,3/8·β*2}.
    Then with probability at least 1 − 4 · d−1/2, the final estimator β̂ = β(T) satisfies
    β^β*2C·Δ1MR(s*)·log d·Tn. (3.23)
  • For the gradient ascent implementation of the M-step (Algorithm 3) with stepsize set to η = 1, let ŝ and T be
    ŝ=C·max{16/9,132/31}·s*,T=log{C·R/[Δ2MR(s*)·log d/n]}log 2,where Δ2MR(s*)=(ŝ+C·s*)·max{β*22+σ2,1,ŝ·β*2}, and C1.
    We suppose d and n are sufficiently large such that Td and
    C·Δ2MR(s*)·log d·Tnmin{R/4,3/8·β*2}.
    Then with probability at least 1 − 4 · d−1/2, the final estimator β̂ = β(T) satisfies
    β^β*2C·Δ2MR(s*)·log d·Tn. (3.24)
Proof

The proof is similar to Theorem 3.7.

To understand the intuition of Theorem 3.10, we suppose that ‖β*‖2, σ, R and r are absolute constants, which implies ŝ = C·s* and Δ1MR(s*)=C·s*,Δ2MR(s*)=C·s*. Therefore, for the maximization and gradient ascent implementations of the M-step, we have T = C′·log [n/(s*·log d)] and T = C″·log {n/[(s*)2·log d]} correspondingly. Hence, by (3.23) in Theorem 3.10 we have that, for the maximization implementation of the M-step,

β^β*2C·s*·log d·log nn (3.25)

holds with high probability. Meanwhile, from (3.24) in Theorem 3.10 we have that, for the gradient ascent implementation of the M-step,

β^β*2C·s*·log d·log nn (3.26)

holds with high probability. The statistical rates in (3.25) and (3.26) attain the s*·log d/n minimax lower bound up to factors of log n and s*·log n respectively and are therefore near-optimal. Note that the statistical rate of convergence attained by the gradient ascent implementation of the M-step is slower by a s* factor than the rate of the maximization implementation. However, our discussion in §A.2 illustrates that, for mixture of regression model, the gradient ascent implementation doesn’t involve estimating the inverse covariance of X in (2.25). Hence, the gradient ascent implementation is more computationally efficient, and is also applicable to the settings in which X has more general covariance structures.

Regression with Missing Covariates

Recall that we consider the linear regression model with covariates missing completely at random, which is defined in §2.3. The next lemma, which is proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for this model. Recall that pm denotes the probability that each covariate is missing and σ is the standard deviation of the Gaussian noise.

Lemma 3.11

Suppose ‖β*‖2/σ ≤ r, where r > 0 is a constant that denotes the required maximum signal-to-noise ratio. Assuming that we have

pm<1/(1+2·b+2·b2), where b=r2·(1+κ)2,

for some constant κ ∈ (0, 1), then we have that Conditions Lipschitz-Gradient-2(γ2, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with

γ2=b+pm·(1+2·b+2·b2)1+b<1,μ=ν=1, (3.27)
={β:ββ*2R} with R=κ·β*2.
Proof

See the proof of Corollary 6 in Balakrishnan et al. (2014) for details.

The next lemma proves Condition 3.3 for the gradient ascent implementation of the M-step.

Lemma 3.12

Suppose ℬ is defined in (3.27) and ‖β*‖2/σ ≤ r. For the gradient ascent implementation of the M-step (Algorithm 3), Condition Statistical-Error(ε, δ, s, n, ℬ) holds with

ε=C·η·[s·β*2·(1+κ)·(1+κ·r)2+max{(1+κ·r)2,σ2+β*22}]·log d+log(12/δ)n (3.28)

for sufficiently large sample size n and C > 0. Here η denotes the stepsize in Algorithm 3.

Proof

See §B.6 for a detailed proof.

Next we establish the implication of Theorem 3.4 for regression with missing covariates.

Theorem 3.13

We consider the gradient ascent implementation of M-step (Algorithm 3) in which η = 1. Under the assumptions of Lemma 3.11, we suppose the sparsity parameter ŝ takes the form of (3.11) and the initialization βinit satisfies ‖βinitβ*‖2R/2 with ν, μ, γ2, κ and R specified in (3.27). For r > 0 specified in Lemma 3.11, let the total number of iterations T in Algorithm 4 be

T=log{C·R/[ΔRMC(s*)·log d/n]}/log(1/γ2),where ΔRMC(s*)=(ŝ+C·s*1κ)·[ŝ·β*2·(1+κ)·(1+κ·r)2+max{(1+κ·r)2,σ2+β*22}]. (3.29)

We assume d and n are sufficiently large such that Td and

C·ΔRMC(s*)·log d·Tnmin{(1γ2)2·R,(1κ)2/[2·(1+κ)]·β*2}.

We have that, with probability at least 1 − 12 · d−1/2, the final estimator β̂ = β(T) satisfies

β^β*2C·ΔRMC(s*)1γ2·log d·Tn. (3.30)
Proof

The proof is similar to Theorem 3.7.

Assuming r, pm and κ are absolute constants, we have ΔRMC(s*) = C·s* in (3.29) since ŝ = C′·s*. Hence, we obtain T = C″·log{n/[(s*)2·log d]}. By (3.30) we have that, with high probability,

β^β*2C·s*·log d·log nn,

which is near-optimal with respect to the s*·log d/n minimax lower bound. It is worth noting that the assumption ‖β*‖2/σ ≤ r in Lemma 3.12 requires the signal-to-noise ratio to be upper bounded rather than lower bounded, which differs from the assumptions for the previous two models. Such a counter-intuitive phenomenon is explained by Loh and Wainwright (2012).

4 Theory of Inference

To simplify the presentation of the unified framework, we first lay out several technical conditions, which will later be verified for each model. Some of the notations used in this section are introduced in §2.2. The first condition characterizes the statistical rate of convergence of the estimator attained by the high dimensional EM algorithm (Algorithm 4).

Condition 4.1 Parameter-EstimationEM)

It holds that

β^β*1=O(ζEM),

where ζEM scales with s*, d and n.

Since both β̂ and β* are sparse, we can verify this condition for each model based on the ℓ2-norm recovery results in Theorems 3.7, 3.10 and 3.13. The second condition quantifies the statistical error between the gradients of Qn(β*; β*) and Q(β*; β*).

Condition 4.2 Gradient-Statistical-ErrorG)

For the true parameter β*, it holds that

1Qn(β*;β*)1Q(β*;β*)=O(ζG),

where ζG scales with s*, d and n.

Note that for the gradient ascent implementation of the M-step (Algorithm 3) and its population version defined in (3.2), we have

Mn(β*)M(β*)=η·1Qn(β*;β*)1Q(β*;β*).

Thus, we can verify Condition 4.2 for each model following the proof of Lemmas 3.6, 3.9 and 3.12. Recall that Tn(·) is defined in (2.8). The following condition quantifies the difference between Tn(β*) and its expectation. Recall that ‖·‖∞,∞ denotes the maximum magnitude of all entries in a matrix.

Condition 4.3 Tn(·)-ConcentrationT)

For the true parameter β*, it holds that

Tn(β*)𝔼β*[Tn(β*)],=O(ζT),

where ζT scales with d and n.

By Lemma 2.1 we have 𝔼β* [Tn(β*)] = −I(β*). Hence, Condition 4.3 characterizes the statistical rate of convergence of Tn(β*) for estimating the Fisher information. Since β* is unknown in practice, we use Tn(β̂) or Tn(β̂0) to approximate Tn(β*), where β̂0 is defined in (2.11). The next condition quantifies the accuracy of this approximation.

Condition 4.4 Tn(·)-LipschitzL)

For the true parameter β* and any β, we have

Tn(β)Tn(β*),=O(ζL)·ββ*1,

where ζL scales with d and n.

Condition 4.4 characterizes the Lipschitz continuity of Tn(·). We consider β = β̂. Since Condition 4.1 ensures ‖β̂β*‖1 converges to zero at the rate of ζEM, Condition 4.4 implies Tn(β̂) converges to Tn(β*) entrywise at the rate of ζEM·ζL. In other words, Tn(β̂) is a good approximation of Tn(β*).

In the following, we lay out an assumption on several population quantities and the sample size n. Recall that β* = [α*, (γ*)], where α* ∈ ℝ is the entry of interest, while γ* ∈ ℝd−1 is the nuisance parameter. By the notations introduced in §2.2, [I(β*)]γ,γ ∈ ℝ(d−1)×(d−1) and [I(β*)]γ ∈ ℝ(d−1)×1 denote the submatrices of the Fisher information matrix I(β*) ∈ ℝd×d. We define w*, sw* and 𝒮w* as

w*=[I(β*)]γ,γ1·[I(β*)]γ,αd1,sw*=w*0, and 𝒮w*=supp(w*). (4.1)

We define λ1 [I(β*)] and λd [I(β*)] as the largest and smallest eigenvalues of I(β*), and

[I(β*)]α|γ=[I(β*)]α,α[I(β*)]γ,α·[I(β*)]γ,γ1·[I(β*)]γ,α. (4.2)

According to (4.1) and (4.2), we can easily verify that

[I(β*)]α|γ=[1,(w*)]·I(β*)·[1,(w*)]. (4.3)

The following assumption ensures that λd [I(β*)] >0. Hence, [I(β*)]γ,γ in (4.1) is invertible. Also, according to (4.3) and the fact that λd[I(β*)] >0, we have [I(β*)]α|γ >0.

Assumption 4.5

We impose the following assumptions.

  • For positive absolute constants ρmax and ρmin, we assume
    ρmaxλ1[I(β*)]λd[I(β*)]ρmin>0,[I(β*)]α|γ=O(1),[I(β*)]α|γ1=O(1). (4.4)
  • The tuning parameter λ of the Dantzig selector in (2.10) is set to
    λ=C·(ζT+ζL·ζEM)·(1+w*1), (4.5)
    where C ≥ 1 is a sufficiently large absolute constant. We suppose the sample size n is sufficiently large such that
    max{w*1,1}·sw*·λ=o(1),ζEM=o(1),sw*·λ·ζG=o(1/n), (4.6)
    λ·ζEM=o(1/n),max{1,w*1}·ζL·(ζEM)2=o(1/n).

The assumption on λd[I(β*)] guarantees that the Fisher information matrix is positive definite. The other assumptions in (4.4) guarantee the existence of the asymptotic variance of n·Sn(β^0,λ) in the score statistic defined in (2.11) and that of n·α¯(β^,λ) in the Wald statistic defined in (2.19). Similar assumptions are standard in existing asymptotic inference results. For example, for mixture of regression model, Khalili and Chen (2007) impose variants of these assumptions.

For specific models, we will show that ζEM, ζG, ζT and λ all decrease with n, while ζL increases with n at a slow rate. Therefore, the assumptions in (4.6) ensure that the sample size n is sufficiently large. We will make these assumptions more explicit after we specify ζEM, ζG, ζT and ζL for each model. Note the assumptions in (4.6) imply that sw*=w*0 needs to be small. For instance, for λ specified in (4.5), max{w*1,1}·sw*·λ=o(1) in (4.6) implies sw*·ζT=o(1). In the following, we will prove that ζT is of the order log d/n. Hence, we require that sw*=o(n/log d)d1, i.e., w* ∈ ℝd−1 is sparse. We can understand this sparsity assumption as follows.

  • According to the definition of w* in (4.1), we have [I(β*)]γ,γ·w* = [I(β*)]γ. Therefore, such a sparsity assumption suggests that [I(β*)]γ lies within the span of a few columns of [I(β*)]γ,γ.

  • By block matrix inversion formula we have {[I(β*)]−1}γ = δ · w*, where δ ∈ ℝ. Hence, it can also be understood as a sparsity assumption on the (d − 1) × 1 submatrix of [I(β*)]−1.

Such a sparsity assumption on w* is necessary, because otherwise it is difficult to accurately estimate w* in high dimensional regimes. In the context of high dimensional generalized linear models, Zhang and Zhang (2014); van de Geer et al. (2014) impose similar sparsity assumptions.

4.1 Main Results

Equipped with Conditions 4.1–4.4 and Assumption 4.5, we are now ready to establish our theoretical results to justify the inferential methods proposed in §2.2. We first cover the decorrelated score test and then the decorrelated Wald test. Finally, we establish the optimality of our proposed methods.

Decorrelated Score Test

The next theorem establishes the asymptotic normality of the decorrelated score statistic defined in (2.11).

Theorem 4.6

We consider β* = [α*, (γ*)] with α* = 0. Under Assumption 4.5 and Conditions 4.1–4.4, we have that for n → ∞,

n·Sn(β^0,λ)/{[Tn(β^0)]α|γ}1/2DN(0,1), (4.7)

where β̂0 and [Tn(β̂0)]α|γ ∈ ℝ are defined in (2.11).

Proof

See §5.2 for a detailed proof.

Decorrelated Wald Test

The next theorem provides the asymptotic normality of n·[α¯(β^,λ)α*]·{[Tn(β^)]α|γ}1/2, which implies the confidence interval for α* in (2.20) is valid. In particular, for testing the null hypothesis H0: α* = 0, the next theorem implies the asymptotic normality of the decorrelated Wald statistic defined in (2.19).

Theorem 4.7

Under Assumption 4.5 and Conditions 4.1–4.4, we have that for n → ∞,

n·[α¯(β^,λ)α*]·{[Tn(β^)]α|γ}1/2DN(0,1), (4.8)

where [Tn(β̂)]α|γ ∈ ℝ is defined by replacing β̂0 with β̂ in (2.11).

Proof

See §5.2 for a detailed proof.

Optimality

In Lemma 5.3 in the proof of Theorem 4.6 we show that, the limiting variance of the decorrelated score function n·Sn(β^0,λ) is [I(β*)]α|γ, which is defined in (4.2). In fact, van der Vaart (2000) proves that for inferring α* in the presence of the nuisance parameter γ*, [I(β*)]α|γ is named the semiparametric efficient information, i.e., the minimum limiting variance of the (rescaled) score function. Our proposed decorrelated score function achieves such an information lower bound and is hence optimal. Because the decorrelated Wald test and the respective confidence interval are built on the decorrelated score function, the limiting variance of n·[α¯(β^,λ)α*] and the size of the confidence interval are also optimal in terms of the semiparametric information lower bound.

4.2 Implications for Specific Models

To establish the high dimensional inference results for each model, we only need to verify Conditions 4.1–4.4 and determine the key quantities ζEM, ζG, ζT and ζL. In the following, we focus on Gaussian mixture and mixture of regression models.

Gaussian Mixture Model

The following lemma verifies Conditions 4.1 and 4.2.

Lemma 4.8

We have that Conditions 4.1 and 4.2 hold with

ζEM=ŝ·ΔGMM(s*)1exp(C·r2/2)·log d·Tn, and ζG=(β*+σ)·log dn,

where ŝ, ΔGMM(s*), r and T are as defined in Theorem 3.7.

Proof

See §C.6 for a detailed proof.

By our discussion that follows Theorem 3.7, we have that ŝ and ΔGMM(s*) are of the same order as s* and s* respectively, and T is roughly of the order log n. Therefore, ζEM is roughly of the order s*·log d/n·log n. The following lemma verifies Condition 4.3 for Gaussian mixture model.

Lemma 4.9

We have that Condition 4.3 holds with

ζT=(β*2+σ2)/σ2·log dn.
Proof

See §C.7 for a detailed proof.

The following lemma establishes Condition 4.4 for Gaussian mixture model.

Lemma 4.10

We have that Condition 4.4 holds with

ζL=(β*2+σ2)3/2/σ4·(log d+log n)3/2.
Proof

See §C.8 for a detailed proof.

Equipped with Lemmas 4.8–4.10, we establish the inference results for Gaussian mixture model.

Theorem 4.11

Under Assumption 4.5, we have that for n→∞, (4.7) and (4.8) hold for Gaussian mixture model. Also, the proposed confidence interval for α* in (2.20) is valid.

In fact, for Gaussian mixture model we can make (4.6) in Assumption 4.5 more transparent by plugging in ζEM, ζG, ζT and ζL specified above. Particularly, for simplicity we assume all quantities except sw*, s*, d and n are absolute constants. Then we can verify that (4.6) holds if

max{sw*,s*}2·(s*)2·(log d)5=o[n/(log n)2]. (4.9)

According to the discussion following Theorem 3.7, we require s*·log d = o(n/log n) for the estimator β̂ to be consistent. In comparison, (4.9) illustrates that high dimensional inference requires a higher sample complexity than parameter estimation. In the context of high dimensional generalized linear models, Zhang and Zhang (2014); van de Geer et al. (2014) also observe the same phenomenon.

Mixture of Regression Model

The following lemma verifies Conditions 4.1 and 4.2. Recall that ŝ, T, Δ1MR(s*) and Δ2MR(s*) are defined in Theorem 3.10, while σ denotes the standard deviation of the Gaussian noise in mixture of regression model.

Lemma 4.12

We have that Conditions 4.1 and 4.2 hold with

ζEM=ŝ·ΔMR(s*)·log d·Tn, and ζG=max{β*22+σ2,1,s*·β*2}·log dn,

where we have ΔMR(s*)=Δ1MR(s*) for the maximization implementation of the M-step (Algorithm 2), and ΔMR(s*)=Δ2MR(s*) for the gradient ascent implementation of the M-step (Algorithm 3).

Proof

See §C.9 for a detailed proof.

By our discussion that follows Theorem 3.10, we have that ŝ is of the same order as s*. For the maximization implementation of the M-step (Algorithm 2), we have that ΔMR(s*)=Δ1MR(s*) is of the same order as s*. Meanwhile, for the gradient ascent implementation in Algorithm 3, we have that ΔMR(s*)=Δ2MR(s*) is of the same order as s*. Hence, ζEM is of the order s*·log d/n·log n or (s*)3/2·log d/n·log n correspondingly, since T is roughly of the order log n. The next lemma establishes Condition 4.3 for mixture of regression model.

Lemma 4.13

We have that Condition 4.3 holds with

ζT=(log n+log d)·[(log n+log d)·β*12+σ2]/σ2·log dn.
Proof

See §C.10 for a detailed proof.

The following lemma establishes Condition 4.4 for mixture of regression model.

Lemma 4.14

We have that Condition 4.4 holds with

ζL=(β*1+σ)3·(log n+log d)3/σ4.
Proof

See §C.11 for a detailed proof.

Equipped with Lemmas 4.12–4.14, we are now ready to establish the high dimensional inference results for mixture of regression model.

Theorem 4.15

For mixture of regression model, under Assumption 4.5, both (4.7) and (4.8) hold as n → ∞. Also, the proposed confidence interval for α* in (2.20) is valid.

Similar to the discussion that follows Theorem 4.11, we can make (4.6) in Assumption 4.5 more explicit by plugging in ζEM, ζG, ζT and ζL specified in Lemmas 4.12–4.14. Assuming all quantities except sw*, s*, d and n are absolute constants, we have that (4.6) holds if

max{sw*,s*}2·(s*)4·(log d)8=o[n/(log n)2].

In contrast, for high dimensional estimation, we only require (s*)2 · log d = o(n/log n) to ensure the consistency of β̂ by our discussion following Theorem 3.10.

5 Proof of Main Results

We lay out a proof sketch of the main theory. First we prove the results in Theorem 3.4 for parameter estimation and computation. Then we establish the results in Theorems 4.6 and 4.7 for inference.

5.1 Proof of Results for Computation and Estimation

Proof of Theorem 3.4

First we introduce some notations. Recall that the trunc(·, ·) function is defined in (2.7). We define β̅(t+0.5), β̅(t+1) ∈ ℝd as

β¯(t+0.5)=M(β(t)),β¯(t+1)=trunc(β¯(t+0.5),𝒮^(t+0.5)). (5.1)

As defined in (3.1) or (3.2), M(·) is the population version M-step with the maximization or gradient ascent implementation. Here 𝒮̂(t+0.5) denotes the set of index j’s with the top ŝ largest |βj(t+0.5)|s. It is worth noting 𝒮̂(t+0.5) is calculated based on β(t+0.5) in the truncation step (line 6 of Algorithm 4), rather than based on β̅(t+0.5) defined in (5.1).

Our goal is to characterize the relationship between ‖β(t+1)β*‖2 and ‖β(t)β*‖2. According to the definition of the truncation step (line 6 of Algorithm 4) and triangle inequality, we have

β(t+1)β*2=trunc(β(t+0.5),𝒮^(t+0.5))β*2trunc(β(t+0.5),𝒮^(t+0.5))trunc(β¯(t+0.5),𝒮^(t+0.5))2+trunc(β¯(t+0.5),𝒮^(t+0.5))β*2=trunc(β(t+0.5),𝒮^(t+0.5))trunc(β¯(t+0.5),𝒮^(t+0.5))2(i)+β¯(t+1)β*2(ii), (5.2)

where the last equality is obtained from (5.1). According to the definition of the trunc(·, ·) function in (2.7), the two terms within term (i) both have support 𝒮̂(t+0.5) with cardinality ŝ. Thus, we have

trunc(β(t+0.5),𝒮^(t+0.5))trunc(β¯(t+0.5),𝒮^(t+0.5))2=(β(t+0.5)β¯(t+0.5))𝒮^(t+0.5)2ŝ·(β(t+0.5)β¯(t+0.5))𝒮^(t+0.5)ŝ·β(t+0.5)β¯(t+0.5). (5.3)

Since we have β(t+0.5) = Mn(β(t)) and β̅(t+0.5) = M(β(t)), we can further establish an upper bound for the right-hand side by invoking Condition 3.3.

Our subsequent proof will establish an upper bound for term (ii) in (5.2) in two steps. We first characterize the relationship between ‖β̅(t+1)β*‖2 and ‖β̅(t+0.5)β*‖2 and then the relationship between ‖β̅(t+0.5)β*‖2 and ‖β(t)β*‖2. The next lemma accomplishes the first step. Recall that ŝ is the sparsity parameter in Algorithm 4, while s* is the sparsity level of the true parameter β*.

Lemma 5.1

Suppose that we have

β¯(t+0.5)β*2κ·β*2 (5.4)

for some κ ∈ (0, 1). Assuming that we have

ŝ4·(1+κ)2(1κ)2·s*, and ŝ·β(t+0.5)β¯(t+0.5)(1κ)22·(1+κ)·β*2, (5.5)

then it holds that

β¯(t+1)β*2C·s*1κ·β(t+0.5)β¯(t+0.5)+(1+4·s*/ŝ)1/2·β¯(t+0.5)β*2. (5.6)
Proof

The proof is based on fine-grained analysis of the relationship between 𝒮̂(t+0.5) and the true support 𝒮*. In particular, we focus on three index sets, namely, ℐ1 = 𝒮*\𝒮̂(t+0.5), ℐ2 = 𝒮* ∩ 𝒮̂(t+0.5) and ℐ3= 𝒮̂(t+0.5)\𝒮*. Among them, ℐ2 characterizes the similarity between 𝒮̂(t+0.5) and 𝒮*, while ℐ1 and ℐ3 characterize their difference. The key proof strategy is to represent the three distances in (5.6) with the ℓ2-norms of the restrictions of β̅(t+0.5) and β* on the three index sets. In particular, we focus β¯1(t+0.5)2 and β1*2. In order to quantify these ℓ2-norms, we establish a fine-grained characterization for the absolute values of β̅(t+0.5)’s entries that are selected and eliminated within the truncation operation β̅(t+1) ← trunc(β̅(t+0.5), 𝒮̂(t+0.5)). See §B.1 for a detailed proof.

Lemma 5.1 is central to the proof of Theorem 3.4. In detail, the assumption in (5.4) guarantees β̅(t+0.5) is within the basin of attraction. In (5.5), the first assumption ensures the sparsity parameter ŝ in Algorithm 4 is set to be sufficiently large, while second ensures the statistical error is sufficiently small. These assumptions will be verified in the proof of Theorem 3.4. The intuition behind (5.6) is as follows. Let 𝒮̅(t+0.5) = supp(β̅(t+0.5), ŝ), where supp(·, ·) is defined in (2.6). By triangle inequality, the left-hand side of (5.6) satisfies

β¯(t+1)β*2β¯(t+1)trunc(β¯(t+0.5),𝒮¯(t+0.5))2(i)+trunc(β¯(t+0.5),𝒮¯(t+0.5))β*2(ii). (5.7)

Intuitively, the two terms on right-hand side of (5.6) reflect terms (i) and (ii) in (5.7) correspondingly. In detail, for term (i) in (5.7), recall that according to (5.1) and line 6 of Algorithm 4 we have

β¯(t+1)=trunc(β¯(t+0.5),𝒮^(t+0.5)), where 𝒮^(t+0.5)=supp(β(t+0.5),ŝ).

As the sample size n is sufficiently large, β̅(t+0.5) and β(t+0.5) are close, since they are attained by the population version and sample version M-steps correspondingly. Hence, 𝒮̅(t+0.5) = supp(β̅(t+0.5), ŝ) and 𝒮̂(t+0.5) = supp(β(t+0.5), ŝ) should be similar. Thus, in term (i), β̅(t+1) = trunc(β̅(t+0.5), 𝒮̂(t+0.5)) should be close to trunc(β̅(t+0.5), 𝒮̅(t+0.5)) up to some statistical error, which is reflected by the first term on the right-hand side of (5.6).

Also, we turn to quantify the relationship between ‖β̅(t+0.5)β*‖2 in (5.6) and term (ii) in (5.7). The truncation in term (ii) preserves the top ŝ coordinates of β̅(t+0.5) with the largest magnitudes while setting others to zero. Intuitively speaking, the truncation incurs additional error to β̅(t+0.5)’s distance to β*. Meanwhile, note that when β̅(t+0.5) is close to β*, 𝒮̅(t+0.5) is similar to 𝒮*. Therefore, the incurred error can be controlled, because the truncation doesn’t eliminate many relevant entries. In particular, as shown in the second term on the right-hand side of (5.6), such incurred error decays as ŝ increases, since in this case 𝒮̂(t+0.5) includes more entries. According to the discussion for term (i) in (5.7), 𝒮̅(t+0.5) is similar to 𝒮̂(t+0.5), which implies that 𝒮̅(t+0.5) should also cover more entries. Thus, fewer relevant entries are wrongly eliminated by the truncation and hence the incurred error is smaller. The extreme case is that, when ŝ → ∞, term (ii) in (5.7) becomes ‖β̅(t+0.5)β*‖2, which indicates that no additional error is incurred by the truncation. Correspondingly, the second term on the right-hand side of (5.6) also becomes ‖β̅(t+0.5)β*‖2.

Next, we turn to characterize the relationship between ‖β̅(t+0.5)β*‖2 and ‖β(t)β*‖2. Recall β̅(t+0.5) = M(β(t)) is defined in (5.1). The next lemma, which is adapted from Theorems 1 and 3 in Balakrishnan et al. (2014), characterizes the contraction property of the population version M-step defined in (3.1) or (3.2).

Lemma 5.2

Under the assumptions of Theorem 3.4, the following results hold for β(t) ∈ ℬ.

  • For the maximization implementation of the M-step (Algorithm 2), we have
    β¯(t+0.5)β*2(γ1/ν)·β(t)β*2. (5.8)
  • For the gradient ascent implementation of the M-step (Algorithm 3), we have
    β¯(t+0.5)β*2(12·νγ2ν+μ)·β(t)β*2. (5.9)

Here γ1, γ2, μ and ν are defined in Conditions 3.1 and 3.2.

Proof

The proof strategy is to first characterize the M-step using Q(·; β*). According to Condition Concavity-Smoothness(μ, ν, ℬ), −Q(·; β*) is ν-strongly convex and μ-smooth, and thus enjoys desired optimization guarantees. Then Condition Lipschitz-Gradient-1(γ1, ℬ) or Lipschitz-Gradient-2(γ2, ℬ) is invoked to characterize the difference between Q(·; β*) and Q(·; β(t)). We provide the proof in §B.2 for the sake of completeness.

Equipped with Lemmas 5.1 and 5.2, we are now ready to prove Theorem 3.4.

Proof

To unify the subsequent proof for the maximization and gradient implementations of the M-step, we employ ρ ∈ (0, 1) to denote γ1/ν in (5.8) or 1 – 2·(ν − γ2)/(ν + μ) in (5.9). By the definitions of β̅(t+0.5) and β(t+0.5) in (5.1) and Algorithm 4, Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) implies

β(t+0.5)β¯(t+0.5)=Mn/T(β(t))M(β(t))ε

holds with probability at least 1 − δ/T. Then by taking union bound we have that, the event

={β(t+0.5)β¯(t+0.5)ε, for all t{0,,T1}} (5.10)

occurs with probability at least 1 − δ. Conditioning on ℰ, in the following we prove that

β(t)β*2(ŝ+C/1κ·s*)·ε1ρ+ρt/2·β(0)β*2, for all t{1,,T} (5.11)

by mathematical induction.

Before we lay out the proof, we first prove β(0) ∈ ℬ. Recall βinit is the initialization of Algorithm 4 and R is the radius of the basin of attraction ℬ. By the assumption of Theorem 3.4, we have

βinitβ*2R/2. (5.12)

Therefore, (5.12) implies ‖βinitβ*‖2 < κ·‖β*‖2 since R = κ·‖β*‖2. Invoking the auxiliary result in Lemma B.1, we obtain

β(0)β*2(1+4·s*/ŝ)1/2·βinitβ*2(1+4·1/4)1/2·R/2<R. (5.13)

Here the second inequality is from (5.12) as well as the assumption in (3.8) or (3.11), which implies s*/ŝ ≤ (1 − κ)2/[4·(1+κ)2] ≤ 1/4. Thus, (5.13) implies β(0) ∈ ℬ. In the sequel, we prove that (5.11) holds for t = 1. By invoking Lemma 5.2 and setting t = 0 in (5.8) or (5.9), we obtain

β¯(0.5)β*2ρ·β(0)β*2ρ·R<R=κ·β*2,

where the second inequality is from (5.13). Hence, the assumption in (5.4) of Lemma 5.1 holds for β̅(0.5). Furthermore, by the assumptions in (3.8), (3.9), (3.11) and (3.12) of Theorem 3.4, we can also verify that the assumptions in (5.5) of Lemma 5.1 hold conditioning on the event ℰ defined in (5.10). By invoking Lemma 5.1 we have that (5.6) holds for t = 0. Further plugging ‖β(t+0.5)β̅(t+0.5) ≤ ε in (5.10) into (5.6) with t = 0, we obtain

β¯(1)β*2C·s*1κ·ε+(1+4·s*/ŝ)1/2·β¯(0.5)β*2. (5.14)

Setting t = 0 in (5.8) or (5.9) of Lemma 5.2 and then plugging (5.8) or (5.9) into (5.14), we obtain

β¯(1)β*2C·s*1κ·ε+(1+4·s*/ŝ)1/2·ρ·β(0)β*2. (5.15)

For t = 0, plugging (5.3) into term (i) in (5.2), and (5.15) into term (ii) in (5.2), and then applying ‖β(t+0.5)β̅(t+0.5) ≤ ε with t = 0 in (5.10), we obtain

β(1)β*2ŝ·β(0.5)β¯(0.5)+C·s*1κ·ε+(1+4·s*/ŝ)1/2·ρ·β(0)β*2(ŝ+C/1κ·s*)·ε+(1+4·s*/ŝ)1/2·ρ·β(0)β*2. (5.16)

By our assumption that ŝ ≥ 16 ·(1/ρ − 1)−2 ·s* in (3.8) or (3.11), we have (1+4·s*/ŝ)1/21/ρ in (5.16). Hence, from (5.16) we obtain

β(1)β*2(ŝ+C/1κ·s*)·ε+ρ·β(0)β*2, (5.17)

which implies that (5.11) holds for t = 1, since we have 1ρ<1 in (5.11).

Suppose we have that (5.11) holds for some t ≥ 1. By (5.11) we have

β(t)β*2(ŝ+C/1κ·s*)·ε1ρ+ρt/2·β(0)β*2(1ρ)·R+ρ·R=R, (5.18)

where the second inequality is from (5.13) and our assumption (ŝ+C/1κ·s*)·ε(1ρ)2·R in (3.9) or (3.12). Therefore, by (5.18) we have β(t) ∈ ℬ. Then (5.8) or (5.9) in Lemma 5.2 implies

β¯(t+0.5)β*2ρ·β(t)β*2ρ·R<R=κ·β*2,

where the third inequality is from ρ ∈ (0, 1). Following the same proof for (5.17), we obtain

β(t+1)β*2(ŝ+C/1κ·s*)·ε+ρ·β(t)β*2(1+ρ1ρ)·(ŝ+C/1κ·s*)·ε+ρ·ρt/2·β(0)β*2=(ŝ+C/1κ·s*)·ε1ρ+ρ(t+1)/2·β(0)β*2.

Here the second inequality is obtained by plugging in (5.11) for t. Hence we have that (5.11) holds for t + 1. By induction, we conclude that (5.11) holds conditioning on the event ℰ defined in (5.10), which occurs with probability at least 1 − δ. By plugging the specific definitions of ρ into (5.11), and applying ‖β(0)β*‖2R in (5.13) to the right-hand side of (5.11), we obtain (3.10) and (3.13).

5.2 Proof of Results for Inference

Proof of Theorem 4.6

We establish the asymptotic normality of the decorrelated score statistic defined in (2.11) in two steps. We first prove the asymptotic normality of n·Sn(β^0,λ), where β̂0 is defined in (2.11) and Sn(·, ·) is defined in (2.9). Then we prove that − [Tn(β̂0)]α|γ defined in (2.11) is a consistent estimator of n·Sn(β^0,λ)s asymptotic variance. The next lemma accomplishes the first step. Recall I(β*) = −𝔼β*[∇2n(β*)]/n is the Fisher information for ℓn(β*) defined in (2.2).

Lemma 5.3

Under the assumptions of Theorem 4.6, we have that for n → ∞,

n·Sn(β^0,λ)DN(0,[I(β*)]α|γ),

where [I(β*)]α|γ is defined in (4.2).

Proof

Our proof consists of two steps. Note that by the definition in (2.9) we have

n·Sn(β^0,λ)=n·[1Qn(β^0;β^0)]αn·w(β^0,λ)·[1Qn(β^0;β^0)]γ. (5.19)

Recall that w*=[I(β*)]γ,γ1·[I(β*)]γ,α is defined in (4.1). At the first step, we prove

n·Sn(β^0,λ)=n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γ+o(1). (5.20)

In other words, replacing β̂0 and w(β̂0, λ) in (5.19) with the corresponding population quantities β* and w* only introduces an o(1) error term. Meanwhile, by Lemma 2.1 we have ∇1Qn(β*; β*) = ∇ℓn(β*)/n. Recall that ℓn(·) is the log-likelihood defined in (2.2), which implies that in (5.20)

n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γ=n·[1,(w*)]·n(β*)/n

is a (rescaled) average of n i.i.d. random variables. At the second step, we calculate the mean and variance of each term within this average and invoke the central limit theorem. Finally we combine these two steps by invoking Slutsky’s theorem. See §C.3 for a detailed proof.

The next lemma establishes the consistency of −[Tn(β̂0)]α|γ for estimating [I(β*)]α|γ. Recall that [Tn(β̂0)]α|γ ∈ ℝ and [I(β*)]α|γ ∈ ℝ are defined in (2.11) and (4.2) respectively.

Lemma 5.4

Under the assumptions of Theorem 4.6, we have

[Tn(β^0)]α|γ+[I(β*)]α|γ=o(1). (5.21)
Proof

For notational simplicity, we abbreviate w(β̂0, λ) in the definition of [Tn(β̂0)]α|γ as ŵ0. By (2.11) and (4.3), we have

[Tn(β^0)]α|γ=(1,ŵ0)·Tn(β^0)·(1,ŵ0),[I(β*)]α|γ=[1,(w*)]·I(β*)·[1,(w*)].

First, we establish the relationship between ŵ0 and w* by analyzing the Dantzig selector in (2.10). Meanwhile, by Lemma 2.1 we have 𝔼β* [Tn(β*)] = −I(β*). Then by triangle inequality we have

|Tn(β^0)+I(β*)||Tn(β^0)Tn(β*)|(i)+|Tn(β*)𝔼β*[Tn(β*)]|(ii).

We prove term (i) is o(1) by quantifying the Lipschitz continuity of Tn(·) using Condition 4.4. We then prove term (ii) is o(1) by concentration analysis. Together with the result on the relationship between ŵ0 and w* we establish (5.21). See §C.4 for a detailed proof.

Combining Lemmas 5.3 and 5.4 using Slutsky’s theorem, we obtain Theorem 4.6.

Proof of Theorem 4.7

Similar to the proof of Theorem 4.6, we first prove n·[α¯(β^,λ)α*] is asymptotically normal, where α̅(β̂, λ) is defined in (2.18), while β̂ is the estimator attained by the high dimensional EM algorithm (Algorithm 4). By Lemma 5.4, we then show that −{[Tn(β̂)]α|γ}−1 is a consistent estimator of n·[α¯(β^,λ)α*]s asymptotic variance. Here recall that [Tn(β̂)]α|γ is defined by replacing β̂0 with β̂ in (2.11). The following lemma accomplishes the first step.

Lemma 5.5

Under the assumptions of Theorem 4.7, we have that for n → ∞,

n·[α¯(β^,λ)α*]DN(0,[I(β*)]α|γ1), (5.22)

where [I(β*)]α|γ > 0 is defined in (4.2).

Proof

Our proof consists of two steps. Similar to the proof of Lemma 5.3, we first prove

n·[α¯(β^,λ)α*]=n·[I(β*)]α|γ1·{[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ}+o(1).

Then we prove that, as n → ∞,

n·[I(β*)]α|γ1·{[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ}DN(0,[I(β*)]α|γ1).

Combining the two steps by Slutsky’s theorem, we obtain (5.22). See §C.5 for a detailed proof.

Now we accomplish the second step for proving Theorem 4.7. Note that by replacing β̂0 with β̂ in the proof of Lemma 5.4, we have

[Tn(β^)]α|γ+[I(β*)]α|γ=o(1). (5.23)

Meanwhile, by the definitions of w* and [I(β*)]α|γ in (4.1) and (4.2), we have

[I(β*)]α|γ=[I(β*)]α,α[I(β*)]γ,α·[I(β*)]γ,γ1·[I(β*)]γ,α=[I(β*)]α,α2·(w*)·[I(β*)]γ,α+(w*)·[I(β*)]γ,γ·w*=[1,(w*)]·I(β*)·[1,(w*)]. (5.24)

Note that (4.4) in Assumption 4.5 implies that I(β*) is positive definite, which yields [I(β*)]α|γ > 0 by (5.24). Also, (4.4) gives [I(β*)]α|γ = O(1) and [I(β*)]α|γ1=O(1). Hence, by (5.23) we have

[Tn(β^)]α|γ1=[I(β*)]α|γ1+o(1). (5.25)

Combining (5.22) in Lemma 5.5 and (5.25) by invoking Slutsky’s theorem, we obtain Theorem 4.7.

6 Numerical Results

In this section, we lay out numerical results illustrating the computational and statistical efficiency of the methods proposed in §2. In §6.1 and §6.2, we present the results for parameter estimation and asymptotic inference respectively. Throughout §6, we focus on the high dimensional EM algorithm without resampling, which is illustrated in Algorithm 1.

6.1 Computation and Estimation

We empirically study two latent variable models: Gaussian mixture model and mixture of regression model. The first synthetic dataset is generated according to the Gaussian mixture model defined in (2.24). The second one is generated according to the mixture of regression model defined in (2.25). We set d = 256 and s* = 5. In particular, the first 5 entries of β* are set to {4, 4, 4, 6, 6}, while the rest 251 entries are set to zero. For Gaussian mixture model, the standard deviation of V in (2.24) is set to σ = 1. For mixture of regression model, the standard deviation of the Gaussian noise V in (2.25) is σ = 0.1. For both datasets, we generate n = 100 data points.

We apply the proposed high dimensional EM algorithm to both datasets. In particular, we apply the maximization implementation of the M-step (Algorithm 2) to Gaussian mixture model, and the gradient ascent implementation of the M-step (Algorithm 3) to mixture of regression model. We set the stepsize in Algorithm 3 to be η = 1. In Figure 1, we illustrate the evolution of the optimization error ‖β(t)β̂*‖2 and overall estimation error ‖β(t)β*‖2 with respect to the iteration index t. Here β̂ is the final estimator and β* is the true parameter.

Figure 1.

Figure 1

Evolution of the optimization error ‖β(t)β̂2 and overall estimation error ‖β(t)β*‖2 (in logarithmic scale). (a) The high dimensional EM algorithm for Gaussian mixture model (with the maximization implementation of the M-step). (b) The high dimensional EM algorithm for mixture of regression model (with the gradient ascent implementation of the M-step). Here note that in (a) the optimization error for t = 5, …, 10 is truncated due to arithmetic underflow.

Figure 1 illustrates the geometric decay of the optimization error, as predicted by Theorem 3.4. As the optimization error decreases to zero, β(t) converges to the final estimator β̂ and the overall estimation error ‖β(t)β*‖2 converges to the statistical error ‖β̂β*‖2. In Figure 2, we illustrate the statistical rate of convergence of ‖β̂β*‖2 In particular, we plot ‖β̂β*‖2 against s*·log d/n with varying s* and n. Figure 2 illustrates that the statistical error exhibits a linear dependency on s*·log d/n. Hence, the proposed high dimensional EM algorithm empirically achieves an estimator with the optimal s*·log d/n statistical rate of convergence.

Figure 2.

Figure 2

Statistical error ‖β̂β*‖2 plotted against s*·log d/n with fixed d=128 and varying s* and n. (a) The high dimensional EM algorithm for Gaussian mixture model (with the maximization implementation of the M-step). (b) The high dimensional EM algorithm for mixture of regression model (with the gradient ascent implementation of the M-step).

6.2 Asymptotic Inference

To examine the validity of the proposed hypothesis testing procedures, we consider the same setting as in §6.1. Recall that we have β* = [4, 4, 4, 6, 6, 0, …, 0] ∈ ℝ256. We consider the null hypothesis H0:β10*=0. We construct the decorrelated score and Wald statistics based on the estimator obtained in the previous section, and fix the significance level to δ = 0.05. We repeat the experiment for 500 times and calculate the rejection rate as the type I error. In Table 1, we report the type I errors of the decorrelated score and Wald tests. In detail, Table 1 illustrates that the type I errors achieved by the proposed hypothesis testing procedures are close to the significance level, which validates our proposed hypothesis testing procedures.

Table 1.

Type I errors of the decorrelated Score and Wald tests

Gaussian Mixture Model Mixture of Regression
Decorrelated Score Test 0.052 0.050
Decorrelated Wald Test 0.049 0.049

7 Conclusion

We propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure. Our theory shows that, with a suitable initialization, the proposed algorithm converges at a geometric rate and achieves an estimator with the (near-)optimal statistical rate of convergence. Beyond point estimation, we further propose the decorrelated score and Wald statistics for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. We apply the proposed algorithmic framework to a broad family of high dimensional latent variable models. For these models, our framework establishes the first computationally feasible approach for optimal parameter estimation and asymptotic inference under high dimensional settings. Thorough numerical simulations are provided to back up our theory.

Acknowledgments

The authors are grateful for the support of NSF CAREER Award DMS1454377, NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.

Appendix

A Applications to Latent Variable Models

We provide the specific forms of Qn(·; ·), Mn(·) and Tn(·) for the models defined in §2.3. Recall that Qn(·; ·) is defined in (2.5), Mn(·) is defined in Algorithms 2 and 3, and Tn(·) is defined in (2.8).

A.1 Gaussian Mixture Model

Let y1, …, yn be the n realizations of Y. For the E-step (line 4 of Algorithm 1), we have

Qn(β;β)=12ni=1nωβ(yi)·yiβ22+[1ωβ(yi)]·yi+β22,where ωβ(y)=11+exp(β,y/σ2). (A.1)

The maximization implementation (Algorithm 2) of the M-step takes the form

Mn(β)=2ni=1nωβ(yi)·yi1ni=1nyi. (A.2)

Meanwhile, for the gradient ascent implementation (Algorithm 3) of the M-step, we have

Mn(β)=β+η·1Qn(β;β), where 1Qn(β;β)=1ni=1n[2·ωβ(yi)1]·yiβ.

Here η > 0 is the stepsize. For asymptotic inference, Tn(·) in (2.8) takes the form

Tn(β)=1ni=1nνβ(yi)·yi·yiId, where νβ(y)=4/σ2[1+exp(2·β,y/σ2)]·[1+exp(2·β,y/σ2)].

A.2 Mixture of Regression Model

Let y1, …, yn and x1, …, xn be the n realizations of Y and X. For the E-step (line 4 of Algorithm 1), we have

Qn(β;β)=12ni=1nωβ(xi,yi)·(yixi,β)2+[1ωβ(xi,yi)]·(yi+xi,β)2,where ωβ(x,y)=11+exp(y·β,x/σ2). (A.3)

For the maximization implementation (Algorithm 2) of the M-step (line 5 of Algorithm 1), we have that Mn(β) = argmaxβ Qn(β′; β) satisfies

Σ^·Mn(β)=1ni=1n[2·ωβ(xi,yi)1]·yi·xi, where Σ^=1ni=1nxi·xi. (A.4)

However, in high dimensional regimes, the sample covariance matrix Σ̂ is not invertible. To estimate the inverse covariance matrix of X, we use the CLIME estimator proposed by Cai et al. (2011), i.e.,

Θ^=argminΘd×dΘ1,1, subject to Σ^·ΘId,λCLIME, (A.5)

where ‖·‖1,1 and ‖·‖∞,∞ are the sum and maximum of the absolute values of all entries respectively, and λCLIME > 0 is a tuning parameter. Based on (A.4), we modify the maximization implementation of the M-step to be

Mn(β)=Θ^·1ni=1n[2·ωβ(xi,yi)1]·yi·xi. (A.6)

For the gradient ascent implementation (Algorithm 3) of the M-step, we have

Mn(β)=β+η·1Qn(β;β),where 1Qn(β,β)1ni=1n[2·ωβ(xi,yi)·yi·xixi·xi·β]. (A.7)

Here η > 0 is the stepsize. For asymptotic inference, Tn(·) in (2.8) takes the form

Tn(β)=1ni=1nνβ(xi,yi)·xi·xi·yi21ni=1nxi·xi,where νβ(x,y)=4/σ2[1+exp(2·y·β,x/σ2)]·[1+exp(2·y·β,x/σ2)].

It is worth noting that, for the maximization implementation of the M-step, the CLIME estimator in (A.5) requires that Σ−1 is sparse, where Σ is the population covariance of X. Since we assume X ~ N (0, Id), this requirement is satisfied. Nevertheless, for more general settings where Σ doesn’t possess such a structure, the gradient ascent implementation of the M-step is a better choice, since it doesn’t require inverse covariance estimation and is also more efficient in computation.

A.3 Regression with Missing Covariates

For notational simplicity, we define zi ∈ ℝd (i = 1, …, n) as the vector with zi,j = 1 if xi,j is observed and zi,j = 0 if xi,j is missing. Let xiobs be the subvector corresponding to the observed component of xi. For the E-step (line 4 of Algorithm 1), we have

Qn(β;β)=1ni=1nyi·(β)·mβ(xiobs,yi)12·(β)·Kβ(xiobs,yi)·β. (A.8)

Here mβ(·, ·) ∈ ℝd and Kβ(·, ·) ∈ ℝd×d are defined as

mβ(xiobs,yi)=zixi+yiβ,zixiσ2+(1zi)β22·(1zi)β, (A.9)
Kβ(xiobs,yi)=diag(1zi)+mβ(xiobs,yi)·[mβ(xiobs,yi)][(1zi)mβ(xiobs,yi)]·[(1zi)mβ(xiobs,yi)], (A.10)

where ⊙ denotes the Hadamard product and diag(1zi) is defined as the d×d diagonal matrix with the entries of 1zi ∈ ℝd on its diagonal. Note that maximizing Qn(β′; β) over β′ requires inverting Kβ(xiobs,yi), which may be not invertible in high dimensions. Thus, we consider the gradient ascent implementation (Algorithm 3) of the M-step (line 5 of Algorithm 1), in which we have

Mn(β)=β+η·1Qn(β;β),where 1Qn(β;β)=1ni=1nyi·mβ(xiobs,yi)Kβ(xiobs,yi)·β. (A.11)

Here η > 0 is the stepsize. For asymptotic inference, we can similarly calculate Tn(·) according to its definition in (2.8). However, here we omit the detailed form of Tn(·) since it is overly complicated.

B Proof of Results for Computation and Estimation

We provide the detailed proof of the main results in §3 for computation and parameter estimation. We first lay out the proof for the general framework, and then the proof for specific models.

B.1 Proof of Lemma 5.1

Proof

Recall β̅(t+0.5) and β̅(t+1) are defined in (5.1). Note that in (5.4) of Lemma 5.1 we assume

β¯(t+0.5)β*2κ·β*2, (B.1)

which implies

(1κ)·β*2β¯(t+0.5)2(1+κ)·β*2. (B.2)

For notational simplicity, we define

θ¯=β¯(t+0.5)/β¯(t+0.5)2,θ=β(t+0.5)/β¯(t+0.5)2, and θ*=β*/β*2. (B.3)

Note that θ̅ and θ* are unit vectors, while θ is not, since it is obtained by normalizing β(t+0.5) with ‖β̅(t+0.5)2. Recall that the supp(·, ·) function is defined in (2.6). Hence we have

supp(θ*)=supp(β*)=𝒮*, and supp(θ,ŝ)=supp(β(t+0.5),ŝ)=𝒮^(t+0.5), (B.4)

where the last equality follows from line 6 of Algorithm 4. To ease the notation, we define

1=𝒮*\𝒮^(t+0.5),2=𝒮*𝒮^(t+0.5), and 3=𝒮^(t+0.5)\𝒮*. (B.5)

Let s1 = |ℐ1|, s2 = |ℐ2| and s3 = |ℐ3| correspondingly. Also, we define Δ = 〈θ̅, θ*〉. Note that

Δ=θ¯,θ*=j𝒮*θ¯j·θj*=j1θ¯j·θj*+j2θ¯j·θj*θ¯12·θ1*2+θ¯22·θ2*2. (B.6)

Here the first equality is from supp(θ*) = 𝒮*, the second equality is from (B.5) and the last inequality is from Cauchy-Schwarz inequality. Furthermore, from (B.6) we have

Δ2(θ¯12·θ1*2+θ¯22·θ2*2)2θ¯122·(θ1*22+θ2*22)+θ¯222·(θ1*22+θ2*22)=θ¯122+θ¯2221θ¯322. (B.7)

To obtain the second inequality, we expand the square and apply 2aba2 + b2. In the equality and the last inequality of (B.7), we use the fact that θ* and θ̅ are both unit vectors.

By (2.6) and (B.4), 𝒮̂(t+0.5) contains the index j’s with the top ŝ largest |βj(t+0.5)|s. Therefore, we have

β3(t+0.5)22s3=j3(βj(t+0.5))2s3j1(βj(t+0.5))2s1=β1(t+0.5)22s1, (B.8)

because from (B.5) we have ℐ3 ⊆ 𝒮̂(t+0.5) and ℐ1 ∩ 𝒮̂(t+0.5) = ∅. Taking square roots of both sides of (B.8) and then dividing them by ‖β̅(t+0.5)2 (which is nonzero according to (B.2)), by the definition of θ in (B.3) we obtain

θ32s3θ12s1. (B.9)

Equipped with (B.9), we now quantify the relationship between ‖θ̅32 and ‖θ̅12. For notational simplicity, let

ε˜=2·θ¯θ=2·β¯(t+0.5)β(t+0.5)/β¯(t+0.5)2. (B.10)

Note that we have

max{θ3θ¯32s3,θ1θ¯12s1}max{θ3θ¯3,θ1θ¯1}θ¯θ=ε˜/2,

which implies

θ¯32s3θ32s3θ3θ¯32s3θ12s1θ3θ¯32s3θ¯12s1θ¯1θ12s1θ3θ¯32s3θ¯12s1ε˜, (B.11)

where second inequality is obtained from (B.9), while the first and third are from triangle inequality. Plugging (B.11) into (B.7), we obtain

Δ21θ¯3221(s3/s1·θ¯12s3·ε˜)2.

Since by definition we have Δ = 〈θ̅, θ*〉 ∈ [−1, 1], solving for ‖θ̅12 in the above inequality yields

θ¯12s1/s3·1Δ2+s1·ε˜s*/ŝ·1Δ2+s*·ε˜. (B.12)

Here we employ the fact that s1s* and s1/s3 ≤ (s1 + s2)/(s3 + s2) = s*/ŝ, which follows from (B.5) and our assumption in (5.5) that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2] < 1.

In the following, we prove that the right-hand side of (B.12) is upper bounded by Δ, i.e.,

s*/ŝ·1Δ2+s*·ε˜Δ. (B.13)

We can verify that a sufficient condition for (B.13) to hold is that

Δs*·ε˜+[s*·ε˜2(s*/ŝ+1)·(s*·ε˜2s*/ŝ)]1/2s*/ŝ+1=s*·ε˜+[(s*·ε˜)2/ŝ+(s*/ŝ+1)·(s*/ŝ)]1/2s*/ŝ+1, (B.14)

which is obtained by solving for Δ in (B.13). When we are solving for Δ in (B.13), we use the fact that s*·ε˜Δ, which holds because

s*·ε˜ŝ·ε˜=2·ŝ·β¯(t+0.5)β(t+0.5)β¯(t+0.5)21κ1+κΔ. (B.15)

The first inequality is from our assumption in (5.5) that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2] < 1. The equality is from the definition of ε̃ in (B.10). The second inequality follows from our assumption in (5.5) that

ŝ·β(t+0.5)β¯(t+0.5)(1κ)22·(1+κ)·β*2

and the first inequality in (B.2). To prove the last inequality in (B.15), we note that (B.1) implies

β¯(t+0.5)22+β*222·β¯(t+0.5),β*=β¯(t+0.5)β*22κ2·β*22.

This together with (B.3) implies

Δ=θ¯,θ*=β¯(t+0.5),β*β¯(t+0.5)2·β*2β¯(t+0.5)22+β*22κ2·β*222·β¯(t+0.5)2·β*2(1κ)2+1κ22·(1+κ)=1κ1+κ, (B.16)

where in the second inequality we use both sides of (B.2). In summary, we have that (B.15) holds. Now we verify that (B.14) holds. By (B.15) we have

ŝ·ε˜1κ1+κ<1<(s*+ŝ)/ŝ,

which implies ε˜s*+ŝ/ŝ. For the right-hand side of (B.14) we have

s*·ε˜+[(s*·ε˜)2/ŝ+(s*/ŝ+1)·(s*/ŝ)]1/2s*/ŝ+1s*·ε˜+[(s*/ŝ+1)·(s*/ŝ)]1/2s*/ŝ+12·s*/(s*+ŝ), (B.17)

where the last inequality is obtained by plugging in ε˜s*+ŝ/ŝ. Meanwhile, note that we have

2·s*/(s*+ŝ)2·1/[1+4·(1+κ)2/(1κ)2](1κ)/(1+κ)Δ, (B.18)

where the first inequality is from our assumption in (5.5) that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2], while the last inequality is from (B.16). Combining (B.17) and (B.18), we then obtain (B.14). By (B.14) we further establish (B.13), i.e., the right-hand side of (B.12) is upper bounded by Δ, which implies

θ¯12Δ. (B.19)

Furthermore, according to (B.6) we have

Δθ¯12·θ1*2+θ¯22·θ2*2θ¯12·θ1*2+1θ¯122·1θ1*22, (B.20)

where in the last inequality we use the fact θ* and θ̅ are unit vectors. Now we solve for θ1*2 in (B.20). According to (B.19) and the fact that θ1*2θ*2=1, on the right-hand side of (B.20) we have θ¯12·θ1*2θ¯12Δ. Thus, we have

(Δθ¯12·θ1*2)2(1θ¯122)·(1θ1*22).

Further by solving for θ1*2 in the above inequality, we obtain

θ1*2θ¯12·Δ+1θ¯122·1Δ2θ¯12+1Δ2(1+s*/ŝ)·1Δ2+s*·ε˜, (B.21)

where in the second inequality we use the fact that Δ ≤ 1, which follows from its definition, while in the last inequality we plug in (B.12). Then combining (B.12) and (B.21), we obtain

θ¯12·θ1*2[s*/ŝ·1Δ2+s*·ε˜]·[(1+s*/ŝ)·1Δ2+s*·ε˜]. (B.22)

Note that by (5.1) and the definition of θ̅ in (B.3), we have

β¯(t+1)=trunc(β¯(t+0.5),𝒮^(t+0.5))=trunc(θ¯,𝒮^(t+0.5))·β¯(t+0.5)2.

Therefore, we have

β¯(t+1)/β¯(t+0.5)2,β*/β*2=trunc(θ¯,𝒮^(t+0.5)),θ*=θ¯2,θ2*=θ¯,θ*θ¯1,θ1*θ¯,θ*θ¯12·θ1*2,

where the second and third equalities follow from (B.5). Let χ̅ = ‖β̅(t+0.5)2 · ‖β*‖2. Plugging (B.22) into the right-hand side of the above inequality and then multiplying χ̅ on both sides, we obtain

β¯(t+1),β*β¯(t+0.5),β*[s*/ŝ·χ¯·(1Δ2)+s*·χ¯·ε˜]·[(1+s*/ŝ)·χ¯·(1Δ2)+s*·χ¯·ε˜]=β¯(t+0.5),β*(s*/ŝ+s*/ŝ)·χ¯·(1Δ2)(1+2·s*/ŝ)·χ¯·(1Δ2)(i)·s*·χ¯·ε˜(ii)(s*·χ¯·ε˜)2. (B.23)

For term (i) in (B.23), note that 1Δ22·(1Δ). By (B.3) and the definition that Δ = 〈θ̅, θ*〉, for term (i) we have

χ¯·(1Δ2)2·χ¯·(1Δ)2·β¯(t+0.5)2·β*22·β¯(t+0.5),β*β¯(t+0.5)22+β*222·β¯(t+0.5),β*=β¯(t+0.5)β*2. (B.24)

For term (ii) in (B.23), by the definition of ε̃ in (B.10) we have

χ¯·ε˜=β¯(t+0.5)2·β*2·2·β¯(t+0.5)β(t+0.5)/β¯(t+0.5)2=2·β¯(t+0.5)β(t+0.5)·β*2/β¯(t+0.5)221κ·β¯(t+0.5)β(t+0.5), (B.25)

where the last inequality is obtained from (B.2). Plugging (B.24) and (B.25) into (B.23), we obtain

β¯(t+1),β*β¯(t+0.5),β*(s*/ŝ+s*/ŝ)·β¯(t+0.5)β*22(1+2·s*/ŝ)·β¯(t+0.5)β*2·2·s*1κ·β¯(t+0.5)β(t+0.5)4·s*1κ·β¯(t+0.5)β(t+0.5)2. (B.26)

Meanwhile, according to (5.1) we have that β̅(t+1) is obtained by truncating β̅(t+0.5), which implies

β¯(t+1)22+β*22β¯(t+0.5)22+β*22. (B.27)

Subtracting two times both sides of (B.26) from (B.27), we obtain

β¯(t+1)β*22(1+2·s*/ŝ+2·s*/ŝ)·β¯(t+0.5)β*22+(1+2·s*/ŝ)·4·s*1κ·β¯(t+0.5)β(t+0.5)·β¯(t+0.5)β*2+8·s*1κ·β¯(t+0.5)β(t+0.5)2.

We can easily verify that the above inequality implies

β¯(t+1)β*22(1+2·s*/ŝ+2·s*/ŝ)·[β¯(t+0.5)β*2+2·s*1κ·β¯(t+0.5)β(t+0.5)]2+8·s*1κ·β¯(t+0.5)β(t+0.5)2.

Taking square roots of both sides and utilizing the fact that a2+b2a+b(a,b>0), we obtain

β¯(t+1)β*2(1+4·s*/ŝ)1/2·β¯(t+0.5)β*2+C·s*1κ·β¯(t+0.5)β(t+0.5), (B.28)

where C > 0 is an absolute constant. Here we utilize the fact that s*/ŝs*/ŝ and

1+2·s*/ŝ+2·s*/ŝ5,

both of which follow from our assumption that s*/ŝ ≤ (1 − κ)2/[4·(1 + κ)2] < 1 in (5.5). By (B.28) we conclude the proof of Lemma 5.1.

B.2 Proof of Lemma 5.2

In the following, we prove (5.8) and (5.9) for the maximization and gradient ascent implementation of the M-step correspondingly.

Proof of (5.8)

To prove (5.8) for the maximization implementation of the M-step (Algorithm 2), note that by the self-consistency property (McLachlan and Krishnan, 2007) we have

β*=argmaxβQ(β;β*). (B.29)

Hence, β* satisfies the following first-order optimality condition

ββ*,1Q(β*;β*)0, for all β,

where ∇1Q(·, ·) denotes the gradient taken with respect to the first variable. In particular, it implies

β¯(t+0.5)β*,1Q(β*;β*)0. (B.30)

Meanwhile, by (5.1) and the definition of M(·) in (3.1), we have

β¯(t+0.5)=M(β(t))=argmax βQ(β;β(t)).

Hence we have the following first-order optimality condition

ββ¯(t+0.5),1Q(β¯(t+0.5);β(t))0, for all β,

which implies

β*β¯(t+0.5),1Q(β¯(t+0.5);β(t))0. (B.31)

Combining (B.30) and (B.31), we then obtain

β*β¯(t+0.5),1Q(β*;β*)β*β¯(t+0.5),1Q(β¯(t+0.5);β(t)),

which implies

β*β¯(t+0.5),1Q(β¯(t+0.5);β*)1Q(β*;β*)β*β¯(t+0.5),1Q(β¯(t+0.5);β*)1Q(β¯(t+0.5);β(t)). (B.32)

In the following, we establish upper and lower bounds for both sides of (B.32) correspondingly. By applying Condition Lipschitz-Gradient-1(γ1, ℬ), for the right-hand side of (B.32) we have

β*β¯(t+0.5),1Q(β¯(t+0.5);β*)1Q(β¯(t+0.5);β(t))β*β¯(t+0.5)2·1Q(β¯(t+0.5);β*)1Q(β¯(t+0.5);β(t))2γ1·β*β¯(t+0.5)2·β*β(t)2, (B.33)

where the last inequality is from (3.3). Meanwhile, for the left-hand side of (B.32), we have

Q(β¯(t+0.5);β*)Q(β*;β*)+1Q(β*;β*),β¯(t+0.5)β*ν/2·β¯(t+0.5)β*22, (B.34)
Q(β*;β*)Q(β¯(t+0.5);β*)+1Q(β¯(t+0.5);β*),β*β¯(t+0.5)ν/2·β¯(t+0.5)β*22 (B.35)

by (3.6) in Condition Concavity-Smoothness(μ, ν, ℬ). By adding (B.34) and (B.35), we obtain

ν·β¯(t+0.5)β*22β*β¯(t+0.5),1Q(β¯(t+0.5);β*)1Q(β*;β*). (B.36)

Plugging (B.33) and (B.36) into (B.32), we obtain

ν·β¯(t+0.5)β*22γ1·β*β¯(t+0.5)2·β*β(t)2,

which implies (5.8) in Lemma 5.2.

Proof of (5.9)

We turn to prove (5.9). The self-consistency property in (B.29) implies that β* is the maximizer of Q(·; β*). Furthermore, (3.5) and (3.6) in Condition Concavity-Smoothness(μ, ν, ℬ) ensure that −Q(·; β*) is μ-smooth and ν-strongly convex. By invoking standard optimization results for minimizing strongly convex and smooth objective functions, e.g., in Nesterov (2004), for stepsize η = 2/(ν + μ), we have

β(t)+η·1Q(β(t);β*)β*2(μνμ+ν)·β(t)β*2, (B.37)

i.e., the gradient ascent step decreases the distance to β* by a multiplicative factor. Hence, for the gradient ascent implementation of the M-step, i.e., M(·) defined in (3.2), we have

β¯(t+0.5)β*2=M(β(t))β*2=β(t)+η·1Q(β(t);β(t))β*2β(t)+η·1Q(β(t);β*)β*2+η·1Q(β(t);β*)1Q(β(t);β(t))2(μνμ+ν)·β(t)β*2+η·γ2·β(t)β*2, (B.38)

where the last inequality is from (B.37) and (3.4) in Condition Lipschitz-Gradient-2(γ2, ℬ). Plugging η = 2/(ν + μ) into (B.38), we obtain

β¯(t+0.5)β*2(μν+2·γ2μ+ν)·β(t)β*2,

which implies (5.9). Thus, we conclude the proof of Lemma 5.2.

B.3 Auxiliary Lemma for Proving Theorem 3.4

The following lemma characterizes the initialization step in line 2 of Algorithm 4.

Lemma B.1

Suppose that we have ‖βinitβ*‖2 ≤ κ · ‖β*‖2 for some κ ∈ (0, 1). Assuming that ŝ ≥ 4 · (1 + κ)2/(1 − κ)2 · s*, we have β(0)β*2(1+4·s*/ŝ)1/2·βinitβ*2.

Proof

Following the same proof of Lemma 5.1 with both β̅(t+0.5) and β(t+0.5) replaced with βinit, β̅(t+1) replaced with β(0) and 𝒮̂(t+0.5) replaced with 𝒮̂init, we reach the conclusion.

B.4 Proof of Lemma 3.6

Proof

Recall that Q(·; ·) is the expectation of Qn(·; ·). According to (A.1) and (3.1), we have

M(β)=𝔼[2·ωβ(Y)·YY]

with ωβ(·) being the weight function defined in (A.1), which together with (A.2) implies

Mn(β)M(β)=1ni=1n[2·ωβ(yi)1]·yi𝔼{[2·ωβ(Y)1]·Y}. (B.39)

Recall yi is the i-th realization of Y, which follows the mixture distribution. For any u > 0, we have

𝔼{exp[u·Mn(β)M(β)]}=𝔼{maxj{1,,d} exp[u·|[Mn(β)M(β)]j|]}j=1d𝔼{exp [u·|[Mn(β)M(β)]j|]}. (B.40)

Based on (B.39), we apply the symmetrization result in Lemma D.4 to the right-hand side of (B.40). Then we have

𝔼{exp[u·Mn(β)M(β)]}j=1d𝔼{exp[u·|1ni=1nξi·[2·ωβ(yi)1]·yi,j|]}, (B.41)

where ξ1, …, ξn are i.i.d. Rademacher random variables that are independent of y1, …, yn. Then we invoke the contraction result in Lemma D.5 by setting

f(yi,j)=yi,j,={f},ψi(υ)=[2·ωβ(yi)1]·υ, and ϕ(υ)=exp(u·υ),

where u is the variable of the moment generating function in (B.40). From the definition of ωβ(·) in (A.1) we have |2 · ωβ(yi) − 1| ≤ 1, which implies

|ψi(υ)ψi(υ)||[2·ωβ(yi)1]·(υυ)||υυ|, for all υ,υ.

Therefore, by Lemma D.5 we obtain

𝔼{exp[u·|1ni=1nξi·[2·ωβ(yi)1]·yi,j|]}𝔼{exp[2·u·|1ni=1nξi·yi,j|]} (B.42)

for the right-hand side of (B.41), where j ∈ {1, …, d}. Here note that in Gaussian mixture model we have yi,j=zi·βj*+υi,j, where zi is a Rademacher random variable and υi,j ~ N(0, σ2). Therefore, according to Example 5.8 in Vershynin (2010) we have zi·βj*ψ2|βj*| and ‖υi,jψ2C · σ. Hence by Lemma D.1 we have

yi,jψ2=zi·βj*+υi,jψ2C·|βj*|2+C·σ2C·β*2+C·σ2.

Since |ξi · yi,j| = |yi,j|, ξi · yi,j and yi,j have the same ψ2-norm. Because ξi is a Rademacher random variable independent of yi,j, we have 𝔼(ξi · yi,j) = 0. By Lemma 5.5 in Vershynin (2010), we obtain

𝔼[exp(u·ξi·yi,j)]exp[(u)2·C·(β*2+C·σ2)], for all u. (B.43)

Hence, for the right-hand side of (B.42) we have

𝔼{exp[2·u·|1ni=1nξi·yi,j|]}𝔼(max{exp[2·u·1ni=1nξi·yi,j],exp[2·u·1ni=1nξi·yi,j]})𝔼{exp[2·u·1ni=1nξi·yi,j]}+𝔼{exp[2·u·1ni=1nξi·yi,j]}2·exp[C·u2·(β*2+C·σ2)/n]. (B.44)

Here the last inequality is obtained by plugging (B.43) with u′ = 2 · u/n and u′ = −2 · u/n respectively into the two terms. Plugging (B.44) into (B.42) and then into (B.41), we obtain

𝔼{exp[u·Mn(β)M(β)]}2·d·exp[C·u2·(β*2+C·σ2)/n].

By Chernoff bound we have that, for all u > 0 and υ > 0,

[Mn(β)M(β)>υ]𝔼{exp[u·Mn(β)M(β)]}/exp(u·υ)2·exp[C·u2·(β*2+C·σ2)/nu·υ+log d].

Minimizing the right-hand side over u we obtain

[Mn(β)M(β)>υ]2·exp{n·υ2/[4·C·(β*2+C·σ2)]+log d}.

Setting the right-hand side to be δ, we have that

υ=C·(β*2+C·σ2)1/2·log d+log(2/δ)nC·(β*+σ)·log d+log(2/δ)n

holds for some absolute constants C, C′ and C″, which completes the proof of Lemma 3.6.

B.5 Proof of Lemma 3.9

In the sequel, we first establish the result for the maximization implementation of the M-step and then for the gradient ascent implementation.

Maximization Implementation

For the maximization implementation we need to estimate the inverse covariance matrix Θ* = Σ−1 with the CLIME estimator Θ̂ defined in (A.5). The following lemma from Cai et al. (2011) quantifies the statistical rate of convergence of Θ̂. Recall that ‖·‖1,∞ is defined as the maximum of the row ℓ1-norms of a matrix.

Lemma B.2

For Σ = Id and λCLIME=C·log d/n in (A.5), we have that

Θ^Θ*1,C·log d+log(1/δ)n

holds with probability at least 1 − ∞, where C and C′ are positive absolute constants.

Proof

See the proof of Theorem 6 in Cai et al. (2011) for details.

Now we are ready to prove (3.21) of Lemma 3.9.

Proof

Recall that Q(·; ·) is the expectation of Qn(·; ·). According to (A.3) and (3.1), we have

M(β)=𝔼{[2·ωβ(X,Y)1]·Y·X} (B.45)

with ωβ(·, ·) being the weight function defined in (A.3), which together with (A.6) implies

Mn(β)M(β)=Θ^·1ni=1n[2·ωβ(xi,yi)1]·yi·xi𝔼{[2·ωβ(X,Y)1]·Y·X}.

Here Θ̂ is the CLIME estimator defined in (A.5). For notational simplicity, we denote

ω¯i=2·ωβ(xi,yi)1, and ω¯=2·ωβ(X,Y)1. (B.46)

It is worth noting that both ω̅i and ω̅ depend on β. Note that we have

Mn(β)M(β)Θ^·[1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)]+(Θ^Id)·𝔼(ω¯·Y·X)Θ^1,(i)·1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)(ii)+Θ^Id1,(iii)·𝔼(ω¯·Y·X)(iv). (B.47)

Analysis of Term (i)

For term (i) in (B.47), recall that by our model assumption we have Σ = Id, which implies Θ* = Σ−1 = Id. By Lemma B.2, for a sufficiently large sample size n, we have that

Θ^1,Θ^Id1,+Id1,1/2+1=3/2 (B.48)

holds with probability at least 1 − δ/4.

Analysis of Term (ii)

For term (ii) in (B.47), we have that for u > 0,

𝔼{exp[u·1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)]}=𝔼{maxj{1,,d} exp[u·|1ni=1nω¯i·yi·xi,j𝔼(ω¯·Y·Xj)|]}j=1d𝔼{exp[u·|1ni=1nω¯i·yi·xi,j𝔼(ω¯·Y·Xj)|]}j=1d𝔼{exp[u·|1ni=1nξi·ω¯i·yi·xi,j|]}, (B.49)

where ξ1,…, ξn are i.i.d. Rademacher random variables. The last inequality follows from Lemma D.4. Furthermore, for the right-hand side of (B.49), we invoke the contraction result in Lemma D.5 by setting

f(yi·xi,j)=yi·xi,j,={f},ψi(υ)=ω¯i·υ, and ϕ(υ)=exp(u·υ),

where u is the variable of the moment generating function in (B.49). From the definitions in (A.3) and (B.46) we have |ω̅i| = |2 · ωβ(xi, yi) − 1 | ≤ 1, which implies

|ψi(υ)ψi(υ)||[2·ωβ(xi,yi)1]·(υυ)||υυ|, for all υ,υ.

By Lemma D.5, we obtain

𝔼{exp[u·|1ni=1nξi·ω¯i·yi·xi,j|]}𝔼{exp[2·u·|1ni=1nξi·yi·xi,j|]} (B.50)

for j ∈ {1,…, d} on the right-hand side of (B.49). Recall that in mixture of regression model we have yi = zi · 〈β*, xi〉 + υi, where zi is a Rademacher random variable, υi ~ N(0, σ2), and xi ~ N(0, Id). Then by Example 5.8 in Vershynin (2010) we have ‖zi · 〈β*, xi〉‖ψ2 = ‖〈β*, xi〉‖ψ2C · ‖β*‖2 and ‖υi,jψ2C′ · σ. By Lemma D.1 we further have

yiψ2=zi·β*,xi+υiψ2C·β*22+C·σ2.

Note that we have ‖xi,jψ2C″ since xi,j ~ N(0, 1). Therefore, by Lemma D.2 we have

ξi·yi·xi,jψ1=yi·xi,jψ1max {C·β*22+C·σ2,C}C·max {β*22+σ2,1}.

Since ξi is a Rademacher random variable independent of yi · xi,j, we have 𝔼(ξi ·yi ·xi,j) = 0. Hence, by Lemma 5.15 in Vershynin (2010), we obtain

𝔼[exp(u·ξi·yi·xi,j)]exp[C·(u)2·max {β*22+σ2,1}2] (B.51)

for all |u′| ≤ C′/ max {β*22+σ2, 1}. Hence we have

𝔼{exp[2·u·|1ni=1nξi·yi·xi,j|]}𝔼(max{exp[2·u·1ni=1nξi·yi·xi,j],exp[2·u·1ni=1nξi·yi·xi,j]})𝔼{exp[2·u·1ni=1nξi·yi·xi,j]}+𝔼{exp[2·u·1ni=1nξi·yi·xi,j]}2·exp[C·u2·max {β*22+σ2,1}2/n]. (B.52)

The last inequality is obtained by plugging (B.51) with u′ = 2·u/n and u′ = −2·u/n correspondingly into the two terms. Here |u|C·n/max{β*22+σ2,1}. Plugging (B.52) into (B.50) and further into (B.49), we obtain

𝔼{exp[u·1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)]}2·d·exp[C·u2·max {β*22+σ2,1}2/n].

By Chernoff bound we have that, for all υ > 0 and |u|C·n/max{β*22+σ2,1},

[1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)>υ]𝔼{exp[u·1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)]}/exp(u·υ)2·exp[C·u2·max {β*22+σ2,1}2/nu·υ+log d].

Minimizing over u on its right-hand side we have that, for 0<υC·max{β*22+σ2,1},

[1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)>υ]2·exp{n·υ2/[4·C·max {β*22+σ2,1}2]+log d}.

Setting the right-hand side of the above inequality to be δ/2, we have that

1ni=1nω¯i·yi·xi𝔼(ω¯·Y·X)υ=C·max {β*22+σ2,1}·log d+log(4/δ)n (B.53)

holds with probability at least 1 − δ/2 for a sufficiently large n.

Analysis of Term (iii)

For term (iii) in (B.47), by Lemma B.2 we have

Θ^Id1,C·log d+log(4/δ)n (B.54)

with probability at least 1 − δ/4 for a sufficiently large n.

Analysis of Term (iv)

For term (iv) in (B.47), recall that by (B.45) and (B.46) we have

M(β)=𝔼{[2·ωβ(X,Y)1]·Y·X}=𝔼(ω¯·Y·X),

which implies

𝔼(ω¯·Y·X)=M(β)M(β)β*2+β*2ββ*2+β*2(1+1/32)·β*2, (B.55)

where the first inequality follows from triangle inequality and ‖·‖ ≤ ‖·‖2, the second inequality is from the proof of (5.8) in Lemma 5.2 with β̅(t+0.5) replaced with β and the fact that γ1/ν < 1 in (5.8), and the third inequality holds since in Condition Statistical-Error(ε, δ, s, n, ℬ) we suppose that ℬ ∈ ℬ, and for mixture of regression model ℬ is specified in (3.20).

Plugging (B.48), (B.53), (B.54) and (B.55) into (B.47), by union bound we have that

Mn(β)M(β)C·max {β*22+σ2,1}·log d+log(4/δ)n+C·β*2·log d+log(4/δ)nC·[max {β*22+σ2,1}+β*2]·log d+log(4/δ)n

holds with probability at least 1 − δ. Therefore, we conclude the proof of (3.21) in Lemma 3.9.

Gradient Ascent Implementation

In the following, we prove (3.22) in Lemma 3.9.

Proof

Recall that Q(·; ·) is the expectation of Qn(·; ·). According to (A.3) and (3.2), we have

M(β)=β+η·𝔼[2·ωβ(X,Y)·Y·Xβ]

with ωβ(·, ·) being the weight function defined in (A.3), which together with (A.7) implies

Mn(β)M(β)=η·1ni=1n[2·ωβ(xi,yi)·yi·xixi·xi·β]η·𝔼[2·ωβ(X,Y)·Y·Xβ]η·1ni=1n[2·ωβ(xi,yi)·yi·xi]𝔼[2·ωβ(X,Y)·Y·X](i)+η·1ni=1nxi·xi·ββ(ii). (B.56)

Here η > 0 denotes the stepsize in Algorithm 3.

Analysis of Term (i)

For term (i) in (B.56), we redefine ω̅i and ω̅ in (B.46) as

ω¯i=2·ωβ(xi,yi), and ω¯=2·ωβ(X,Y). (B.57)

Note that |ω̅i| = |2 · ωβ(xi, yi)| ≤ 2. Following the same way we establish the upper bound of term (ii) in (B.47), under the new definitions of ω̅i and ω̅ in (B.57) we have that

1ni=1n[2·ωβ(xi,yi)·yi·xi]𝔼[2·ωβ(X,Y)·Y·X]C·max {β*22+σ2,1}·log d+log(4/δ)n

holds with probability at least 1 − δ/2.

Analysis of Term (ii)

For term (ii) in (B.56), we have

1ni=1nxi·xi·ββ1ni=1nxi·xiId,(ii).a·β1(ii).b.

For term (ii).a, recall by our model assumption we have 𝔼(X·X) = Id and xi’s are the independent realizations of X. Hence we have

1ni=1nxi·xiId,maxj{1,,d}maxk{1,,d}|1ni=1nxi,j·xi,k𝔼(Xj·Xk)|.

Since Xj, Xk ~ N(0, 1), according to Example 5.8 in Vershynin (2010) we have ‖Xjψ2 = ‖Xkψ2C. By Lemma D.2, Xj ·Xk is a sub-exponential random variable with ‖Xj ·Xkψ1C′. Moreover, we have ‖Xj ·Xk − 𝔼(Xj · Xk)‖ψ1C″ by Lemma D.3. Then by Bernstein’s inequality (Proposition 5.16 in Vershynin (2010)) and union bound, we have

[1ni=1nxi·xiId,>υ]2·d2·exp(C·n·υ2)

for 0 < υ ≤ C′ and a sufficiently large sample size n. Setting its right-hand side to be δ/2, we have

1ni=1nxi·xiId,C·2·log d+log(4/δ)n

holds with probability at least 1 − δ/2. For term (ii).b we have β1s·β2, since in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume ‖β0s. Furthermore, we have ‖β2 ≤ ‖β*‖2 + ‖β* − β2 ≤ (1 + 1/32) · ‖β*‖2, because in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume that ℬ ∈ ℬ, and for mixture of regression model ℬ is specified in (3.20).

Plugging the above results into the right-hand side of (B.56), by union bound we have that

Mn(β)M(β)η·C·max {β*22+σ2,1}·log d+log(4/δ)n+η·C·2·log d+log(4/δ)n·s·β*2η·C·max {β*22+σ2,1,s·β*2}·log d+log(4/δ)n

holds with probability at least 1 − δ. Therefore, we conclude the proof of (3.22) in Lemma 3.9.

B.6 Proof of Lemma 3.12

Proof

Recall that Q(·; ·) is the expectation of Qn(·; ·). Let Xobs be the subvector corresponding to the observed component of X in (2.26). By (A.8) and (3.2), we have

M(β)=β+η·𝔼[Kβ(Xobs,Y)·β+mβ(Xobs,Y)·Y]

with η > 0 being the stepsize in Algorithm 3, which together with (A.11) implies

Mn(β)M(β)=η·{1ni=1nKβ(xiobs,yi)𝔼[Kβ(Xobs,Y)]}·β+1ni=1nmβ(xiobs,yi)·yi𝔼[mβ(Xobs,Y)·Y]η·1ni=1nKβ(xiobs,yi)𝔼[Kβ(Xobs,Y)],(i)·β1(ii)+η·1ni=1nmβ(xiobs,yi)·yi𝔼[mβ(Xobs,Y)·Y](iii). (B.58)

Here Kβ(·, ·) ∈ ℝd×d and mβ(·, ·) ∈ ℝd are defined in (A.9) and (A.10). To ease the notation, let

Kβ(xiobs,yi)=K¯i,Kβ(Xobs,Y)=K¯,mβ(xiobs,yi)=m¯i,mβ(Xobs,Y)=m¯. (B.59)

Let the entries of Z∈ℝd be i.i.d. Bernoulli random variables, each of which is zero with probability pm, and z1,…, zn be the n independent realizations of Z. If zi,j = 1, then xi,j is observed, otherwise xi,j is missing (unobserved).

Analysis of Term (i)

For term (i) in (B.58), recall that we have

K¯=diag(1Z)+m¯·m¯[(1Z)m¯]·[(1Z)m¯],

where diag(1Z) denotes the d × d diagonal matrix with the entries of 1Z ∈ ℝd on its diagonal, and ⊙ denotes the Hadamard product. Therefore, by union bound we have

[1ni=1nK¯i𝔼K¯,>υ]j=1dk=1d{[|1ni=1n[diag(1zi)]j,k𝔼[diag(1z)]j,k|>υ/3]+[|1ni=1nm¯ji·m¯ki𝔼(m¯j·m¯k)|>υ/3]}+[|1ni=1n(1zi,j)·m¯ji·(1zi,k)·m¯ki𝔼[(1Zj)·m¯j·(1Zk)·m¯k]|>υ/3]}. (B.60)

According to Lemma B.3 we have, for all j ∈ {1,…, d}, j is a zero-mean sub-Gaussian random variable with ‖jψ2C · (1 + κ · r). Then by Lemmas D.2 and D.3 we have

m¯j·m¯k𝔼(m¯j·m¯k)ψ12·m¯j·m¯kψ1C·max{m¯jψ22,m¯kψ22}C·(1+κ·r)2.

Meanwhile, since |1 − Zj | ≤ 1, it holds that ‖(1 − Zj) · jψ2 ≤ ‖jψ2C · (1 + κ · r). Similarly, by Lemmas D.2 and D.3 we have

(1Zj)·m¯j·(1Zk)·m¯k𝔼[(1Zj)·m¯j·(1Zk)·m¯k]ψ1C·(1+κ·r)2.

In addition, for the first term on the right-hand side of (B.60) we have

[diag(1Z)]j,k𝔼[diag(1Z)]j,kψ22·[diag(1Z)]j,kψ22,

where the first inequality is from Lemma D.3 and the second inequality follows from Example 5.8 in Vershynin (2010) since [diag(1Z)]j,k ∈ [0, 1]. Applying Hoeffding’s inequality to the first term on the right-hand side of (B.60) and Bernstein’s inequality to the second and third terms, we obtain

[1ni=1nK¯i𝔼K¯,>υ]d2·2·exp(C·n·υ2)+d2·4·exp[C·n·υ2/(1+κ·r)4].

Setting the two terms on the right-hand side of the above inequality to be δ/6 and δ/3 respectively, we have that

1ni=1nK¯i𝔼K¯,υ=C·max{(1+κ·r)2,;1}·log d+log(12/δ)n=C·(1+κ·r)2·log d+log(12/δ)n

holds with probability at least 1 − δ/2, for sufficiently large constant C″ and sample size n.

Analysis of Term (ii)

For term (ii) in (B.58) we have

β1s·β2s·β*2+s·β*β2s·(1+κ)·β*2,

where the first inequality holds because in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume ‖β0s, while the last inequality holds since in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume that β ∈ ℬ, and for regression with missing covariates ℬ is specified in (3.27).

Analysis of Term (iii)

For term (iii) in (B.58), by union bound we have

[1ni=1nm¯i·yi𝔼(m¯·Y)>υ]j=1d[|1ni=1nm¯ji·yi𝔼(m¯j·Y)|>υ]d·2·exp[C·n·υ2/max{(1+κ·r)4,(σ2+β*22)2}] (B.61)

for a sufficiently large n. Here the second inequality is from Bernstein’s inequality, since we have

m¯j·Y𝔼(m¯j·Y)ψ1C·max{m¯jψ22,Yψ22}C·max{(1+κ·r)2,σ2+β*22}.

Here the first inequality follows from Lemmas D.2 and D.3 and the second inequality is from Lemma B.3 and the fact that Y=X,β*+V~N(0,β*22+σ2), because X ~ N(0, Id) and V ~ N(0, σ2) are independent. Setting the right-hand side of (B.61) to be δ/2, we have that

1ni=1nm¯i·yi𝔼(m¯·Y)υ=C·max{(1+κ·r),σ2+β*22}·log d+log(4/δ)n

holds with probability at least 1 − δ/2 for sufficiently large constant C and sample size n.

Finally, plugging in the upper bounds for terms (i)–(iii) in (B.58) we have that

Mn(β)M(β)C·η·[s·β*2·(1+κ)·(1+κ·r)2+max{(1+κ·r)2,σ2+β*22}]·log d+log(12/δ)n

holds with probability at least 1 − δ. Therefore, we conclude the proof of Lemma 3.12.

The following auxiliary lemma used in the proof of Lemma 3.12 characterizes the sub-Gaussian property of defined in (B.59).

Lemma B.3

Under the assumptions of Lemma 3.12, the random vector ∈ ℝd defined in (B.59) is sub-Gaussian with mean zero and ‖jψ2C · (1 + κ · r) for all j ∈ {1,…, d}, where C > 0 is a sufficiently large absolute constant.

Proof

In the proof of Lemma 3.12, Z’s entries are i.i.d. Bernoulli random variables, which satisfy ℙ(Zj = 0) = pm for all j ∈ {1,…, d}. Meanwhile, from the definitions in (A.9) and (B.59), we have

m¯=ZX+Yβ,ZXσ2+(1Z)β22·(1Z)β.

Since we have Y = 〈β*, X〉 + V and − 〈β, ZX〉 = − 〈β, X〉 + 〈β, (1 − Z) ⊙ X〉, it holds that

m¯j=Zj·Xj(i)+V·(1Zj)·βjσ2+(1Z)β22(ii)+β*β,X·(1Zj)·βjσ2+(1Z)β22(iii)+β,(1Z)X·(1Zj)·βjσ2+(1Z)β22(iv). (B.62)

Since X ~ N(0, Id) is independent of Z, we can verify that 𝔼(j) = 0. In the following we provide upper bounds for the ψ2-norms of terms (i)–(iv) in (B.62).

Analysis of Terms (i) and (ii)

For term (i), since Xj ~ N(0, 1), we have ‖Zj · Xjψ2 ≤ ‖Xjψ2C, where the last inequality follows from Example 5.8 in Vershynin (2010). Meanwhile, for term (ii) in (B.62), we have that for any u > 0,

[|V·(1Zj)·βjσ2+(1Z)β22|>u][|V|·(1Z)β2σ2+(1Z)β22>u]=𝔼Z{[|V|·(1Z)β2σ2+(1Z)β22>u|Z]}𝔼Z{2·exp(C·u2·[σ2+(1Z)β22]2/[σ2·(1Z)β22])}𝔼Z{2·exp(C·u2)}=2·exp(C·u2), (B.63)

where the second last inequality is from the fact that ‖Vψ2C″ · σ, while the last inequality holds because we have

[σ2+(1Z)β22]2σ2·(1Z)β22. (B.64)

According to Lemma 5.5 in Vershynin (2010), by (B.63) we conclude that the ψ2-norm of term (ii) in (B.62) is upper bounded by some absolute constant C > 0.

Analysis of Term (iii)

For term (iii), by the same conditioning argument in (B.63), we have

[|β*β,X·(1Zj)·βjσ2+(1Z)β22|>u]𝔼Z{2·exp(C·u2·[σ2+(1Z)β22]2/[β*β22·(1Z)β22])}, (B.65)

where we utilize the fact that ‖〈β* − β, X〉‖ψ2C′· ‖β* − β2, since β*β,X~N(0,β*β22). Note that in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume β ∈ ℬ, and for regression with missing covariates ℬ is specified in (3.27). Hence we have

β*β22κ2·β*22κ2·r2·σ2,

where the first inequality follows from (3.27), and the second inequality follows from our assumption that ‖β*‖2/σ ≤ r on the maximum signal-to-noise ratio. By invoking (B.64), from (B.65) we have

[|β*β,X·(1Zj)·βjσ2+(1Z)β22|>u]2·exp[C·u2/(κ2·r2)],

which further implies that the ψ2-norm of term (iii) in (B.62) is upper bounded by C″ · κ · r.

Analysis of Term (iv)

For term (iv), note that 〈β, (1−Z) ⊙ X〉 = 〈(1−Z) ⊙ β, X〉. By invoking the same conditioning argument in (B.63), we have

[|β,(1Z)X·(1Zj)·βjσ2+(1Z)β22|>u]𝔼Z{2·exp(C·u2·[σ2+(1Z)β22]2/(1Z)β24)}𝔼Z[2·exp(C·u2)]=2·exp(C·u2), (B.66)

where we utilize the fact that ‖〈(1 − Z) ⊙ β, X〉‖ψ2C′· ‖(1 − Z) ⊙ β2 when conditioning on Z. By (B.66) we have that the ψ2-norm of term (iv) in (B.62) is upper bounded by C″ > 0.

Combining the above upper bounds for the ψ2-norms of terms (i)–(iv) in (B.62), by Lemma D.1 we have that

m¯jψ2C+C·κ2·r2C·(1+κ·r)

holds for a sufficiently large absolute constant C″>0. Therefore, we establish Lemma B.3.

C Proof of Results for Inference

In the following, we provide the detailed proof of the theoretical results for asymptotic inference in §4. We first present the proof of the general results, and then the proof for specific models.

C.1 Proof of Lemma 2.1

Proof

In the sequel we establish the two equations in (2.12) respectively.

Proof of the First Equation

According to the definition of the lower bound function Qn(·; ·) in (2.5), we have

Qn(β;β)=1ni=1n𝒵kβ(z|yi)·log fβ(yi,z)dz. (C.1)

Here kβ(z | yi) is the density of the latent variable Z conditioning on the observed variable Y = yi under the model with parameter β. Hence we obtain

1Qn(β;β)=1ni=1n𝒵kβ(z|yi)·fβ(yi,z)/βfβ(yi,z)dz=1ni=1n𝒵fβ(yi,z)/βhβ(yi)dz, (C.2)

where hβ(yi) is the marginal density function of Y evaluated at yi, and the second equality follows from the fact that

kβ(z|yi)=fβ(yi,z)/hβ(yi), (C.3)

since kβ(z | yi) is the conditional density. According to the definition in (2.2), we have

n(β)=i=1n log hβ(yi)β=i=1nhβ(yi)/βhβ(yi)=i=1n𝒵fβ(yi,z)/βhβ(yi)dz, (C.4)

where the last equality is from (2.1). Comparing (C.2) and (C.4), we obtain ∇1Qn(β; β)=∇ℓn(β)/n.

Proof of the Second Equation

For the second equation in (2.12), by (C.1) and (C.3) we have

Qn(β;β)=1ni=1n𝒵fβ(yi,z)hβ(yi)·log fβ(yi,z)dz.

By calculation we obtain

1,22Qn(β;β)=1ni=1n𝒵fβ(yi,z)/βfβ(yi,z){fβ(yi,z)/β·hβ(yi)[hβ(yi)]2fβ(yi,z)·hβ(yi)/β[hβ(yi)]2}dz. (C.5)

Here ⊗ denotes the vector outer product. Note that in (C.5) we have

𝒵fβ(yi,z)/βfβ(yi,z)fβ(yi,z)/βhβ(yi)dz=𝒵[fβ(yi,z)/βfβ(yi,z)]2·fβ(yi,z)hβ(yi)dz=𝒵[fβ(yi,z)/βfβ(yi,z)]2·kβ(z|yi)dz=𝔼β[S˜β(Y,Z)2|Y=yi], (C.6)

where v⊗2 denotes vv. Here β(·, ·) is defined as

S˜β(y,z)= log fβ(y,z)β=fβ(y,z)/βfβ(y,z)d, (C.7)

i.e., the score function for the complete likelihood, which involves both the observed variable Y and the latent variable Z. Meanwhile, in (C.5) we have

𝒵fβ(yi,z)/βfβ(yi,z)fβ(yi,z)·hβ(yi)/β[hβ(yi)]2dz=[𝒵fβ(yi,z)/βhβ(yi)dz]hβ(yi)/βhβ(yi)=[𝒵fβ(yi,z)/βhβ(yi)dz]2, (C.8)

where in the last equality we utilize the fact that

𝒵fβ(yi,z)dz=hβ(yi), (C.9)

because hβ(·) is the marginal density function of Y. By (C.3) and (C.7), for the right-hand side of (C.8) we have

𝔼β[S˜β(Y,Z)|Y=yi]=𝒵fβ(yi,z)/βfβ(yi,z)·kβ(z|yi)dz=𝒵fβ(yi,z)/βhβ(yi)dz. (C.10)

Plugging (C.10) into (C.8) and then plugging (C.6) and (C.8) into (C.5) we obtain

1,22Qn(β;β)=1ni=1n(𝔼β[S˜β(Y,Z)2|Y=yi]{𝔼β[S˜β(Y,Z)|Y=yi]}2).

Setting β = β* in the above equality, we obtain

𝔼β*[1,22Qn(β*;β*)]=𝔼β*{Covβ*[S˜β*(Y,Z)|Y]}. (C.11)

Meanwhile, for β = β*, by the property of Fisher information we have

I(β*)=Covβ*[n(β*)]=Covβ*[ log hβ*(Y)β]=Covβ*{𝔼β*[S˜β*(Y,Z)|Y]}. (C.12)

Here the last equality is obtained by taking β = β* in

 log hβ(Y)β=hβ(Y)β·1hβ(Y)=𝒵fβ(Y,z)/βhβ(Y)dz=𝒵fβ(Y,z)/βfβ(Y,z)·kβ(z|Y)dz=𝒵S˜β(Y,z)·kβ(z|Y)dz=𝔼β[S˜β(Y,Z)|Y],

where the second equality follows from (C.9), the third equality follows from (C.3), while the second last equality follows from (C.7). Combining (C.11) and (C.12), by the law of total variance we have

I(β*)+𝔼β*[1,22Qn(β*;β*)]=Covβ*{𝔼β*[S˜β*(Y,Z)|Y]}+𝔼β*{Covβ*[S˜β*(Y,Z)|Y]}=Covβ*[S˜β*(Y,Z)]. (C.13)

In the following, we prove

𝔼β*[1,12Qn(β*;β*)]=Covβ*[S˜β*(Y,Z)]. (C.14)

According to (C.1) we have

1,12Qn(β;β)=1ni=1n𝒵kβ(z|yi)·2logfβ(yi,z)2βdz=1ni=1n𝔼β[2logfβ(Y,Z)2β|Y=yi]. (C.15)

Let ℓ̃(β)=log fβ(Y, Z) be the complete log-likelihood, which involves both the observed variable Y and the latent variable Z, and Ĩ (β) be the corresponding Fisher information. By setting β = β* in (C.15) and taking expectation, we obtain

𝔼β*[1,12Qn(β*;β*)]=𝔼β*{𝔼β*[2logfβ*(Y,Z)2β|Y]}=𝔼β*[2˜(β*)]=I˜(β*). (C.16)

Since β(Y, Z) defined in (C.7) is the score function for the complete log-likelihood ℓ̃(β), according to the relationship between the score function and Fisher information, we have

I˜(β*)=Covβ*[S˜β*(Y,Z)],

which together with (C.16) implies (C.14). By further plugging (C.14) into (C.13), we obtain

𝔼β*[1,12Qn(β*;β*)+1,22Qn(β*;β*)]=I(β*),

which establishes the first equality of the second equation in (2.12). In addition, the second equality of the second equation in (2.12) follows from the property of Fisher information. Thus, we conclude the proof of Lemma 2.1.

C.2 Auxiliary Lemmas for Proving Theorems 4.6 and 4.7

In this section, we lay out several lemmas on the Dantzig selector defined in (2.10). The first lemma, which is from Bickel et al. (2009), characterizes the cone condition for the Dantzig selector.

Lemma C.1

Any feasible solution w in (2.10) satisfies

[w(β,λ)w]𝒮w¯1[w(β,λ)w]𝒮w1,

where w(β, λ) is the minimizer of (2.10), 𝒮w is the support of w and 𝒮w¯ is its complement.

Proof

See Lemma B.3 in Bickel et al. (2009) for a detailed proof.

In the sequel, we focus on analyzing w(β̂, λ). The results for w(β̂0, λ) can be obtained similarly. The next lemma characterizes the restricted eigenvalue of Tn(β̂), which is defined as

ρ^min=infv𝒞v·[Tn(β^)]·vv22, where 𝒞={v:v𝒮w*¯1v𝒮w*1,v0}. (C.17)

Here 𝒮w* is the support of w* defined in (4.1).

Lemma C.2

Under Assumption 4.5 and Conditions 4.1, 4.3 and 4.4, for a sufficiently large sample size n, we have ρ̂min ≥ ρmin/2 > 0 with high probability, where ρmin is specified in (4.4).

Proof

By triangle inequality we have

ρ^mininfv𝒞v·[Tn(β^)]·vv22infv𝒞v·I(β*)·v|v·[I(β*)+Tn(β^)]·v|v22infv𝒞v·I(β*)·vv22(i)supv𝒞|v·[I(β*)+Tn(β^)]·v|v22(ii), (C.18)

where 𝒞 is defined in (C.17).

Analysis of Term (i)

For term (i) in (C.18), by (4.4) in Assumption 4.5 we have

infv𝒞v·I(β*)·vv22infv0v·I(β*)·vv22=λd[I(β*)]ρmin. (C.19)
Analysis of Term (ii)

For term (ii) in (C.18) we have

supv𝒞|v·[I(β*)+Tn(β^)]·v|v22supv𝒞v12·I(β*)+Tn(β^),v22. (C.20)

By the definition of 𝒞 in (C.17), for any v ∈ 𝒞 we have

v1=v𝒮w*1+v𝒮w*¯12·v𝒮w*12·sw*·v𝒮w*22·sw*·v2.

Therefore, the right-hand side of (C.20) is upper bounded by

4·sw*·I(β*)+Tn(β^),4·sw*·I(β*)+Tn(β*),(ii).a+4·sw*·T(β^)Tn(β*),(ii).b.

For term (ii).a, by Lemma 2.1 and Condition 4.3 we have

4·sw*·I(β*)+Tn(β*),=4·sw*·Tn(β*)𝔼β*[Tn(β*)],=O(sw*·ζT)=o(1),

where the last equality is from (4.6) in Assumption 4.5, since for λ specified in (4.5) we have

sw*·ζTsw*·λmax{w*1,1}·sw*·λ=o(1).

For term (ii).b, by Conditions 4.1 and 4.4 we have

4·sw*·T(β^)Tn(β*),=4·sw*·O(ζL)·β^β*1=O(sw*·ζL·ζEM)=o(1),

where the last equality is also from (4.6) in Assumption 4.5, since for λ specified in (4.5) we have

sw*·ζL·ζEMsw*·λmax{w*1,1}·sw*·λ=o(1).

Hence, term (ii) in (C.18) is o(1). Since ρmin is an absolute constant, for a sufficiently large n we have that term (ii) is upper bounded by ρmin/2 with high probability. Further by plugging this and (C.19) into (C.18), we conclude that ρ̂min ≥ ρmin/2 holds with high probability.

The next lemma quantifies the statistical accuracy of w(β̂, λ), where w(·, ·) is defined in (2.10).

Lemma C.3

Under Assumption 4.5 and Conditions 4.1–4.4, for λ specified in (4.5) we have that

max{w(β^,λ)w*1,w(β^0,λ)w*1}16·sw*·λρmin

holds with high probability. Here ρmin is specified in (4.4), while w* and sw* are defined (4.1).

Proof

For λ specified in (4.5), we verify that w* is a feasible solution in (2.10) with high probability. For notational simplicity, we define the following event

={[Tn(β^)]γ,α[Tn(β^)]γ,γ·w*λ}. (C.21)

By the definition of w* in (4.1), we have [I(β*)]γ − [I(β*)]γ,γ · w* = 0. Hence, we have

[Tn(β^)]γ,α[Tn(β^)]γ,γ·w*=[Tn(β^)+I(β*)]γ,α[Tn(β^)+I(β*)]γ,γ·w*Tn(β^)+I(β*),+Tn(β^)+I(β*),·w*1, (C.22)

where the last inequality is from triangle inequality and Hölder’s inequality. Note that we have

Tn(β^)+I(β*),Tn(β*)+I(β*),+T(β^)Tn(β*),. (C.23)

On the right-hand side, by Lemma 2.1 and Condition 4.3 we have

Tn(β*)+I(β*),=Tn(β*)𝔼β*[Tn(β*)],=O𝕡(ζT),

while by Conditions 4.1 and 4.4 we have

T(β^)Tn(β*),=O𝕡(ζL)·β^β*1=O𝕡(ζL·ζEM).

Plugging the above equations into (C.23) and further plugging (C.23) into (C.22), by (4.5) we have

[Tn(β^)]γ,α[Tn(β^)]γ,γ·w*C·(ζT+ζL·ζEM)·(1+w*1)=λ

holds with high probability for a sufficiently large absolute constant C ≥ 1. In other words, ℰ occurs with high probability. The subsequent proof will be conditioning on ℰ and the following event

={ρ^minρmin/2>0}, (C.24)

which also occurs with high probability according to Lemma C.2. Here ρ̂min is defined in (C.17).

For notational simplicity, we denote w(β̂, λ)= ŵ. By triangle inequality we have

[Tn(β^)]γ,γ·(ŵw*)[Tn(β^)]γ,α[Tn(β^)]γ,γ·w*+[Tn(β^)]γ,γ·ŵ[Tn(β^)]γ,α2·λ, (C.25)

where the last inequality follows from (2.10) and (C.21). Moreover, by (C.17) and (C.24) we have

(ŵw*)·[Tn(β^)]γ,γ·(ŵw*)ρ^min·ŵw*22ρmin/2·ŵw*22. (C.26)

Meanwhile, by Lemma C.1 we have

ŵw*1=(ŵw*)𝒮w*1+(ŵw*)𝒮w*¯12·(ŵw*)𝒮w*12·sw*·ŵw*2.

Plugging the above inequality into (C.26), we obtain

(ŵw*)·[Tn(β^)]γ,γ·(ŵw*)ρmin/(8·sw*)·ŵw*12. (C.27)

Note that by (C.25), the left-hand side of (C.27) is upper bounded by

ŵw*1·[Tn(β^)]γ,γ·(ŵw*)ŵw*1·2·λ. (C.28)

By (C.27) and (C.28), we then obtain ŵw*116·sw*·λ/ρmin conditioning on ℰ and ℰ′, both of which hold with high probability. Note that the proof for w(β̂0, λ) follows similarly. Therefore, we conclude the proof of Lemma C.3.

C.3 Proof of Lemma 5.3

Proof

Our proof strategy is as follows. First we prove that

n·Sn(β^0,λ)=n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γ+o𝕡(1), (C.29)

where β* is the true parameter and w* is defined in (4.1). We then prove

n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γDN(0,[I(β*)]α|γ), (C.30)

where [I(β*)]α|γ is defined in (4.2). Throughout the proof, we abbreviate w(β̂0, λ) as ŵ0. Also, it is worth noting that our analysis is under the null hypothesis where β* = [α*, (γ*)] with α* = 0.

Proof of (C.29)

For (C.29), by the definition of the decorrelated score function in (2.9) we have

Sn(β^0,λ)=[1Qn(β^0;β^0)]αŵ0·[1Qn(β^0;β^0)]γ.

By the mean-value theorem, we obtain

Sn(β^0,λ)=[1Qn(β*;β*)]αŵ0·[1Qn(β*;β*)]γ(i)+[Tn(β)]γ,α·(β^0β*)ŵ0·[Tn(β)]γ,γ·(β^0β*)(ii), (C.31)

where we have Tn(β)=1,12Qn(β;β)+1,22Qn(β;β) as defined in (2.8), and β is an intermediate value between β* and β̂0.

Analysis of Term (i)

For term (i) in (C.31), we have

[1Qn(β*;β*)]αŵ0·[1Qn(β*;β*)]γ=[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ+(w*ŵ0)·[1Qn(β*;β*)]γ. (C.32)

For the right-hand side of (C.32), we have

(w*ŵ0)·[1Qn(β*;β*)]γw*ŵ01·[1Qn(β*;β*)]γ. (C.33)

By Lemma C.3, we have w*ŵ01=O(sw*·λ), where λ is specified in (4.5). Meanwhile, we have

[1Qn(β*;β*)]γ1Qn(β*;β*)=1Qn(β*;β*)1Q(β*;β*)=O𝕡(ζG),

where the first equality follows from the self-consistency property (McLachlan and Krishnan, 2007) that β* = argmaxβ Q(β; β*), which gives ∇1Q(β*; β*) = 0. Here the last equality is from Condition 4.2. Therefore, (C.33) implies

(wŵ0)·[1Qn(β*;β*)]γ=O𝕡(sw*·λ·ζG)=o𝕡(1/n),

where the second equality is from sw*·λ·ζG=o(1/n) in (4.6) of Assumption 4.5. Thus, by (C.32) we conclude that term (i) in (C.31) equals

[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ+o𝕡(1/n).
Analysis of Term (ii)

By triangle inequality, term (ii) in (C.31) is upper bounded by

|[Tn(β^0)]γ,α·(β^0β*)ŵ0·[Tn(β^0)]γ,γ·(β^0β*)|(ii).a+|[Tn(β)]γ,α·(β^0β*)[Tn(β^0)]γ,α·(β^0β*)|(ii).b+|ŵ0·[Tn(β^0)]γ,γ·(β^0β*)ŵ0·[Tn(β)]γ,γ·(β^0β*)|(ii).c. (C.34)

By Hölder’s inequality, term (ii).a in (C.34) is upper bounded by

β^0β*1·[Tn(β^0)]γ,αŵ0·[Tn(β^0)]γ,γ=β^0β*1·λO(ζEM)·λ=o(1/n), (C.35)

where the first inequality holds because ŵ0 = w(β̂0, λ) is a feasible solution in (2.10). Meanwhile, Condition 4.1 gives ‖β̂β*‖1 = OEM). Also note that by definition we have (β̂0)α = (β*)α = 0, which implies ‖β̂0β*‖1 ≤ ‖β̂β*‖1. Hence, we have

β^0β*1=O(ζEM), (C.36)

which implies the first equality in (C.35). The last equality in (C.35) follows from ζEM·λ=o(1/n) in (4.6) of Assumption 4.5. Note that term (ii).b in (C.34) is upper bounded by

[Tn(β)]γ,α[Tn(β^0)]γ,α·β^0β*1Tn(β)Tn(β^0),·β^0β*1. (C.37)

For the first term on the right-hand side of (C.37), by triangle inequality we have

Tn(β)Tn(β^0),Tn(β)Tn(β*),+Tn(β^0)Tn(β*),.

By Condition 4.4, we have

Tn(β^0)Tn(β*),=O(ζL)·β^0β*1, (C.38)

and

Tn(β)Tn(β*),=O(ζL)·ββ*1O(ζL)·β^0β*1, (C.39)

where the last inequality in (C.39) holds because β is defined as an intermediate value between β* and β̂0. Further by plugging (C.36) into (C.38), (C.39) as well as the second term on the right-hand side of (C.37), we have that term (ii).b in (C.34) is OL · (ζEM)2]. Moreover, by our assumption in (4.6) of Assumption 4.5 we have

ζL·(ζEM)2max{1,w*1}·ζL·(ζEM)2=o(1/n).

Thus, we conclude that term (ii).b is o(1/n). Similarly, term (ii).c in (C.34) is upper bounded by

ŵ01·Tn(β)Tn(β^0),·β^0β*1. (C.40)

By triangle inequality and Lemma C.3, the first term in (C.40) is upper bounded by

w*1+ŵ0w*1=w*1+O𝕡(sw*·λ).

Meanwhile, for the second and third terms in (C.40), by the same analysis for term (ii).b in (C.34) we have

Tn(β)Tn(β^0),·β^0β*1=O𝕡[ζL·(ζEM)2].

By (4.6) in Assumption 4.5, since sw*·λ=o(1), we have

(w*1+sw*·λ)·ζL·(ζEM)2[max{1,w*1}+o(1)]·ζL·(ζEM)2=o(1/n).

Therefore, term (ii).c in (C.34) is o(1/n). Hence, by (C.34) we conclude that term (ii) in (C.31) is o(1/n). Combining the analysis for terms (i) and (ii) in (C.31), we then obtain (C.29). In the sequel, we turn to prove the second part on asymptotic normality.

Proof of (C.30)

Note that by Lemma 2.1, we have

n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γ=n·[1,(w*)]·1Qn(β*;β*)=n·[1,(w*)]·n(β*)/n. (C.41)

Recall that ℓn(β*) is the log-likelihood function defined in (2.2). Hence, [1, −(w*)] · ∇ℓn(β*)/n is the average of n independent random variables. Meanwhile, the score function has mean zero at β*, i.e., 𝔼[∇ℓn(β*)] = 0. For the variance of the rescaled average in (C.41), we have

Var{n·[1,(w*)]·n(β*)/n}=[1,(w*)]·Cov[n(β*)/n]·[1,(w*)]=[1,(w*)]·I(β*)·[1,(w*)].

Here the second equality is from the fact that the covariance of the score function equals the Fisher information (up to renormalization). Hence, the variance of each item in the average in (C.41) is

[1,(w*)]·I(β*)·[1,(w*)]=[I(β*)]α,α2·(w*)·[I(β*)]γ,α+(w*)·[I(β*)]γ,γ·w*=[I(β*)]α,α[I(β*)]γ,α·[I(β*)]γ,γ1·[I(β*)]γ,α=[I(β*)]α|γ,

where the second and third equalities are from (4.1) and (4.2). Hence, by the central limit theorem we obtain (C.30). Finally, combining (C.29) and (C.30) by invoking Slutsky’s theorem, we obtain

n·Sn(β^0,λ)DN(0,[I(β*)]α|γ),

which concludes the proof of Lemma 5.3.

C.4 Proof of Lemma 5.4

Proof

Throughout the proof, we abbreviate w(β̂0, λ) as ŵ0. Our proof is under the null hypothesis where β* = [α*, (γ*)] with α* = 0. Recall that w* is defined in (4.1). Then by the definitions of [Tn(β̂0)]α|γand [I(β*)]α|γ in (2.11) and (4.2), we have

[Tn(β^0)]α|γ=(1,ŵ0)·Tn(β^0)·(1,ŵ0)=[Tn(β^0)]α,α2·ŵ0·[Tn(β^0)]γ,α+ŵ0·[Tn(β^0)]γ,γ·ŵ0,
[I(β*)]α|γ=[I(β*)]α,α[I(β*)]γ,α·[I(β*)]γ,γ1·[I(β*)]γ,α=[I(β*)]α,α2·(w*)·[I(β*)]γ,α+(w*)·[I(β*)]γ,γ·w*.

By triangle inequality, we have

|[Tn(β^0)]α|γ+[I(β*)]α|γ||[Tn(β^0)]α,α+[I(β*)]α,α|(i)+2·|ŵ0·[Tn(β^0)]γ,α+(w*)·[I(β*)]γ,α|(ii)+|ŵ0·[Tn(β^0)]γ,γ·ŵ0+(w*)·[I(β*)]γ,γ·w*|(iii). (C.42)
Analysis of Term (i)

For term (i) in (C.42), by Lemma 2.1 and triangle inequality we have

|[Tn(β^0)]α,α+[I(β*)]α,α||[Tn(β^0)]α,α[Tn(β*)]α,α|(i).a+|[Tn(β*)]α,α{𝔼β*[Tn(β*)]}α,α|(i).b. (C.43)

For term (i).a in (C.43), by Condition 4.4 we have

|[Tn(β^0)]α,α[Tn(β*)]α,α|Tn(β^0)Tn(β*),=O(ζL)·β^0β*1. (C.44)

Note that we have (β̂0)α = (β*)α = 0 by definition, which implies ‖β̂0β*‖1 ≤ ‖β̂β*‖1. Hence, by Condition 4.1 we have

β^0β*1=O(ζEM). (C.45)

Moreover, combining (C.44) and (C.45), by (4.6) in Assumption 4.5 we have

ζL·ζEMmax{w*1,1}·sw*·λ=o(1)

for λ specified in (4.5). Hence we obtain

|[Tn(β^0)]α,α[Tn(β*)]α,α|Tn(β^0)Tn(β*),=O(ζL·ζEM)=o(1). (C.46)

Meanwhile, for term (i).b in (C.43) we have

|[Tn(β*)]α,α{𝔼β*[Tn(β*)]}α,α|Tn(β*)𝔼β*[Tn(β*)],=O(ζT)=o(1), (C.47)

where the second last equality follows from Condition 4.3, while the last equality holds because our assumption in (4.6) of Assumption 4.5 implies

ζTmax{w*1,1}·sw*·λ=o(1)

for λ specified in (4.5).

Analysis of Term (ii)

For term (ii) in (C.42), by Lemma 2.1 and triangle inequality, we have

|ŵ0·[Tn(β^0)]γ,α+(w*)·[I(β*)]γ,α||(ŵ0w*)·{Tn(β^0)𝔼β*[Tn(β*)]}γ,α|(ii).a+|(ŵ0w*)·{𝔼β*[Tn(β*)]}γ,α|(ii).b+|(w*)·{Tn(β^0)𝔼β*[Tn(β*)]}γ,α|(ii).c. (C.48)

By Hölder’s inequality, term (ii).a in (C.48) is upper bounded by

ŵ0w*1·{Tn(β^0)𝔼β*[Tn(β)*]}γ,αŵ0w*1·Tn(β^0)𝔼β*[Tn(β*)],.

By Lemma C.3, we have ŵ0w*1=O(sw*·λ). Meanwhile, we have

Tn(β^0)𝔼β*[Tn(β*)],Tn(β^0)Tn(β*),+Tn(β*)𝔼β*[Tn(β*)],=o(1).

where the second equality follows from (C.46) and (C.47). Therefore, term (ii).a is o(1), since (4.6) in Assumption 4.5 implies sw*·λ=o(1). Meanwhile, by Hölder’s inequality, term (ii).b in (C.48) is upper bounded by

ŵ0w*1·{𝔼β*[Tn(β*)]}γ,αŵ0w*1·𝔼β*[Tn(β*)],. (C.49)

By Lemma C.3, we have ŵ0w*1=O(sw*·λ). Meanwhile, we have 𝔼β* [Tn(β*)] = −I(β*) by Lemma 2.1. Furthermore, (4.4) in Assumption 4.5 implies

I(β*),I(β*)2C, (C.50)

where C>0 is some absolute constant. Therefore, from (C.49) we have that term (ii).b in (C.48) is O(sw*·λ). By (4.6) in Assumption 4.5, we have sw*·λ=o(1). Thus, we conclude that term (ii).b is o(1). For term (ii).c, we have

|(w*)·{Tn(β^0)𝔼β*[Tn(β*)]}γ,α|w*1·Tn(β^0)𝔼β*[Tn(β*)],w*1·Tn(β^0)Tn(β*),+w*1·Tn(β*)𝔼β*[Tn(β*)],=O(w*1·ζL·ζEM)+O(w*1·ζT)=o(1).

Here the first and second inequalities are from Hölder’s inequality and triangle inequality, the first equality follows from (C.46) and (C.47), and the second equality holds because (4.6) in Assumption 4.5 implies

w*1·(ζL·ζEM+ζT)max{w*1,1}·sw*·λ=o(1)

for λ specified in (4.5).

Analysis of Term (iii)

For term (iii) in (C.42), by (2.12) in Lemma 2.1 we have

|ŵ0·[Tn(β^0)]γ,γ·ŵ0+(w*)·[I(β*)]γ,γ·w*||ŵ0·{Tn(β^0)𝔼β*[Tn(β*)]}γ,γ·ŵ0|(iii).a+|ŵ0·[I(β*)]γ,γ·ŵ0(w*)·[I(β*)]γ,γ·w*|(iii).b. (C.51)

For term (iii).a in (C.51), we have

|ŵ0·{Tn(β^0)𝔼β*[Tn(β*)]}γ,γ·ŵ0|ŵ012·{Tn(β^0)𝔼β*[Tn(β*)]}γ,γ,ŵ012·Tn(β^0)𝔼β*[Tn(β*)],. (C.52)

For ‖ŵ01 we have ŵ012(w*1+ŵ0w*1)2=[w*1+O(sw*·λ)]2, where the equality holds because by Lemma C.3 we have ŵ0w*1=O(sw*·λ). Meanwhile, on the right-hand side of (C.52) we have

Tn(β^0)𝔼β*[Tn(β*)],Tn(β^0)Tn(β*),+Tn(β*)𝔼β*[Tn(β*)],=O(ζL·ζEM+ζT).

Here the last equality is from (C.46) and (C.47). Hence, term (iii).a in (C.51) is O[(w*1+sw*·λ)2·(ζL·ζEM+ζT)]. Note that

(w*1+sw*·λ)2·(ζL·ζEM+ζT)=w*12·(ζL·ζEM+ζT)(i)+2·sw*·λ(ii)·w*1·(ζL·ζEM+ζT)(iii)+(sw*·λ)2(ii)·(ζL·ζEM+ζT)(iv).

From (4.6) in Assumption 4.5 we have, for λ specified in (4.5), terms (i)–(iv) are all upper bounded by max{w*1,1}·sw*·λ=o(1). Hence, we conclude term (iii).a in (C.51) is o(1). Also, for term (iii).b in (C.51), we have

|ŵ0·[I(β*)]γ,γ·ŵ0(w*)·[I(β*)]γ,γ·w*||(ŵ0w*)·[I(β*)]γ,γ·(ŵ0w*)|+2·|w*·[I(β*)]γ,γ·(ŵ0w*)|ŵ0w*12·I(β*),+2·ŵ0w*1·w*1·I(β)*,=O[(sw*·λ)2+w*1·sw*·λ],

where the last equality follows from Lemma C.3 and (C.50). Moreover, by (4.6) in Assumption 4.5 we have max{sw*·λ,w*1·sw*·λ}max{w*1,1}·sw*·λ=o(1). Therefore, we conclude that term (iii).b in (C.51) is o(1). Combining the above analysis for terms (i)–(iii) in (C.42), we obtain

|[Tn(β^0)]α|γ+[I(β*)]α|γ|=o(1).

Thus we conclude the proof of Lemma 5.4.

C.5 Proof of Lemma 5.5

Proof

Our proof strategy is as follows. Recall that w* is defined in (4.1). First we prove

n·[α¯(β^,λ)α*]=n·[I(β*)]α|γ1·{[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ}+o(1). (C.53)

Here note that [I(β*)]α|γ is defined in (4.2) and β̂ = [α̂, γ̂] is the estimator attained by the high dimensional EM algorithm (Algorithm 4). Then we prove

n·[I(β*)]α|γ1·{[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ}DN(0,[I(β*)]α|γ1). (C.54)
Proof of (C.53)

For notational simplicity, we define

S¯n(β,ŵ)=[1Qn(β;β)]αŵ·[1Qn(β;β)]γ, where ŵ=ω(β^,λ). (C.55)

By the definition of Sn(·,·) in (2.9), we have n (β̂, ŵ) = Sn (β̂, λ). Let β̃ = (α*, γ̂). The Taylor expansion of n (β̂, ŵ) takes the form

S¯n(β^,w^)=S¯n(β˜,ŵ)+(α^α*)·[S¯n(β,ŵ)]α, (C.56)

where β is an intermediate value between β̂ and β̃. By (C.55) and the definition of Tn(·) in (2.8), we have

[S¯n(β^,w^)]α=[1,1Qn(β^;β^)+1,2Qn(β^;β^)]α,αŵ·[1,1Qn(β^;β^)+1,2Qn(β^;β^)]γ,α=[T(β^)]α,αŵ·[T(β^)]γ,α. (C.57)

By (C.57) and the definition in (2.18), we have

α¯(β^,λ)=α^{[T(β^)]α,αŵ·[T(β^)]γ,α}1·S¯n(β^,ŵ)=α^[S¯n(β^,ŵ)]α1·S¯n(β^,ŵ).

Further by (C.56) we obtain

n·[α¯(β^,λ)α*]=n·(α^α*)n·[S¯n(β^,ŵ)]α1·S¯n(β^,ŵ)=n·[S¯n(β^,ŵ)]α1·S¯n(β˜,ŵ)(i)+n·(α^α*)·{1[S¯n(β^,ŵ)]α1·[S¯n(β,ŵ)]α}(ii). (C.58)
Analysis of Term (i)

For term (i) in (C.58), in the sequel we first prove

[S¯n(β^,ŵ)]α=[I(β*)]α|γ+oP(1). (C.59)

By the definition of [I(β*)]α|γ in (4.1) and the definition of w* in (C.53), we have

[I(β*)]α|γ=[I(β*)]α,α[I(β*)]γ,α·[I(β*)]γ,γ1·[I(β*)]γ,α=[I(β*)]α,α(w*)·[I(β*)]γ,α.

Together with (C.57), by triangle inequality we obtain

|[S¯n(β^,ŵ)]α+[I(β*)]α|γ||[Tn(β^)]α,α+[I(β*)]α,α|(i).a+|ŵ·[Tn(β^)]γ,α+(w*)·[I(β*)]γ,α|(i).b. (C.60)

Note that term (i).a in (C.60) is upper bounded by

Tn(β^)+I(β*),Tn(β^)Tn(β*),+Tn(β*)𝔼β*[Tn(β*)],. (C.61)

Here the inequality is from Lemma 2.1 and triangle inequality. For the first term on the right-hand side of (C.61), by Conditions 4.1 and 4.4 we have

Tn(β^)Tn(β*),=O(ζL)·β^β*1=O(ζL·ζEM). (C.62)

For the second term on the right-hand side of (C.61), by Condition 4.3 we have

Tn(β*)𝔼β*[Tn(β*)],=O(ζT). (C.63)

Plugging (C.62) and (C.63) into (C.61), we obtain

Tn(β^)+I(β*),=O(ζL·ζEM+ζT). (C.64)

By (4.6) in Assumption 4.5, we have

ζL·ζEM+ζTmax{w*1,1}·sw*·λ=o(1)

for λ specified in (4.5). Hence, we conclude that term (i).a in (C.60) is o𝕡(1). Meanwhile, for term (i).b in (C.60), by triangle inequality we have

|w^·[Tn(β^)]γ,α+(w*)·[I(β*)]γ,α||(w^w*)·[Tn(β^)+I(β*)]γ,α|+|(w*w^)·[I(β*)]γ,α|+|(w*)·[Tn(β^)+I(β*)]γ,α|. (C.65)

For the first term on the right hand-side of (C.65), we have

|(w^w*)·[Tn(β^)+I(β*)]γ,α|w^w*1·Tn(β^)+I(β*),=O[sw*·λ·(ζT+ζL·ζEM)]=o(1).

The inequality is from Hölder’s inequality, the second last equality is from Lemma C.3 and (C.64), and the last equality holds because (4.6) in Assumption 4.5 implies

max{sw*·λ,ζT+ζL·ζEM}max{w*1,1}·sw*·λ=o(1)

for λ specified in (4.5). For the second term on the right hand-side of (C.65), we have

|(w*w^)·[I(β*)]γ,α|w^w*1·I(β*),O(sw*·λ)·I(β*)2=o(1),

where the second inequality is from Lemma C.3, while the last equality follows from (4.4) and (4.6) in Assumption 4.5. For the third term on the right hand-side of (C.65), we have

|(w*)·[Tn(β^)+I(β*)]γ,α|w*1·Tn(β^)+I(β*),=w*1·O(ζL·ζEM+ζT)=o(1).

Here the inequality is from Hölder’s inequality, the first equality is from (C.64) and the last equality holds because (4.6) in Assumption 4.5 implies

w*1·(ζL·ζEM+ζT)max{w*1,1}·sw*·λ=o(1)

for λ specified in (4.5). By (C.65), we conclude that term (i).b in (C.60) is o𝕡(1). Hence, we obtain (C.59). Furthermore, for term (i) in (C.58), we then replace β̂0 = (0, γ̂) with β̃ = (α*, γ̂) and ŵ0 = w (β̂0, λ) with ŵ = w (β̂, λ) in the proof of (C.29) in §C.3. We obtain

n·S¯n(β˜,w^)=n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γ+o(1),

which further implies

n·[S¯n(β^,ŵ)]α1·S¯n(β˜,ŵ)=n·[I(β*)]α|γ1·{[1Qn(β*;β*)]α(w*)·[1Qn(β*;β*)]γ}+o(1)

by (C.59) and Slutsky’s theorem.

Analysis of Term (ii)

Now we prove that term (ii) in (C.58) is o𝕡(1). We have

n·(α^α*)·{1[S¯n(β^,ŵ)]α1·[S¯n(β,ŵ)]α}n·|α^α*|(ii).a·|1[S¯n(β^,ŵ)]α1·[S¯n(β,ŵ)]α|(ii).b. (C.66)

For term (ii).a in (C.66), by Condition 4.1 and (4.6) in Assumption 4.5 we have

|α^α*|β^β*1=O(ζEM)=o(1). (C.67)

Meanwhile, by replacing β̂ with β̃ = (α*, γ̂) in the proof of (C.59) we obtain

[S¯n(β˜,ŵ)]α=[I(β*)]α|γ+o(1). (C.68)

Combining (C.59) and (C.68), for term (ii).b in (C.66) we obtain

|1[S¯n(β^,ŵ)]α1·[S¯n(β,ŵ)]α|=o(1),

which together with (C.67) implies that term (ii) in (C.58) is o𝕡(1). Plugging the above results into terms (i) and (ii) in (C.58), we obtain (C.53).

Proof of (C.54)

By (C.30) in §C.3, we have

n·[1Qn(β*;β*)]αn·(w*)·[1Qn(β*;β*)]γDN(0,[I(β*)]α|γ),

which implies (C.54). Finally, combining (C.53) and (C.54) with Slutsky’s theorem, we obtain

n·[α¯(β^,λ)α*]DN(0,[I(β*)]α|γ1).

Thus we conclude the proof of Lemma 5.5.

C.6 Proof of Lemma 4.8

Proof

According to Algorithm 4, the final estimator β̂=β(T) has ŝ nonzero entries. Meanwhile, we have ‖β*‖0 = s* ≤ ŝ. Hence, we have β^β*12·ŝ·β^β*2. Invoking (3.19) in Theorem 3.7, we obtain ζEM.

For Gaussian mixture model, the maximization implementation of the M-step takes the form

Mn(β)=2ni=1nωβ(yi)·yi1ni=1nyi, and M(β)=𝔼[2·ωβ(Y)·YY],

where ωβ(·) is defined in (A.1). Meanwhile, we have

1Qn(β;β)=1ni=1n[2·ωβ(yi)1]·yiβ, and 1Q(β;β)=𝔼[2·ωβ(Y)Y]β.

Hence, we have ‖Mn(β) − M(β)‖ = ‖∇1Qn(β; β)−∇1Q(β; β)‖. By setting δ = 2/d in Lemma 3.6, we obtain ζG.

C.7 Proof of Lemma 4.9

Proof

Recall that for Gaussian mixture model we have

Qn(β;β)=12ni=1n{ωβ(yi)·yiβ22+[1ωβ(yi)]·yi+β22},

where ωβ (·) is defined in (A.1). Hence, by calculation we have

1Qn(β;β)=1ni=1n[2·ωβ(yi)1]·yiβ,1,12Qn(β;β)=Id, (C.69)
1,22Qn(β;β)=4ni=1nyi·yiσ2·[1+exp(2·β,yi/σ2)]·[1+exp(2·β,yi/σ2)]. (C.70)

For notational simplicity we define

νβ(y)=4σ2·[1+exp(2·β,y/σ2)]·[1+exp(2·β,y/σ2)]. (C.71)

Then by the definition of Tn(·) in (2.8), from (C.69) and (C.70) we have

{Tn(β*)𝔼β*[Tn(β*)]}j,k=1ni=1nνβ*(yi)·yi,j·yi,k𝔼β*[νβ*(Y)·Yj·Yk].

Applying the symmetrization result in Lemma D.4 to the right-hand side, we have that for u > 0,

𝔼β*{exp[u·|{Tn(β*)𝔼β*[Tn(β*)]}j,k|]}𝔼β*{exp[u·|1ni=1nξi·νβ*(yi)·yi,j·yi,k|]}, (C.72)

where ξ1, …, ξn are i.i.d. Rademacher random variables that are independent of y1, …, yn. Then we invoke the contraction result in Lemma D.5 by setting

f(yi,j·yi,k)=yi,j·yi,k,={f},ψi(υ)=νβ*(yi)·υ, and ϕ(υ)=exp(u·υ),

where u is the variable of the moment generating function in (C.72). By the definition in (C.71) we have |νβ*(yi)| ≤ 4/σ2, which implies

|ψi(υ)ψi(υ)||νβ*(yi)·(υυ)|4/σ2·|υυ|,  for all υ,υ.

Therefore, applying the contraction result in Lemma D.5 to the right-hand side of (C.72), we obtain

𝔼β*{exp[u·|{Tn(β*)𝔼β*[Tn(β*)]}j,k|]}𝔼β*{exp[u·4/σ2·|1ni=1nξi·yi,j·yi,k|]}. (C.73)

Note that 𝔼β*i · yi,j · yi,k) = 0, since ξi is a Rademacher random variable independent of yi,j · yi,k. Recall that in Gaussian mixture model we have yi,j=zi·βj*+υi,j, where zi is a Rademacher random variable and υi,j ~ N(0, σ2). Hence, by Example 5.8 in Vershynin (2010), we have zi·βj*ψ2|βj*| and ‖ υi,jψ2C · σ. Therefore, by Lemma D.1 we have

yi,jψ2=zi·βj*+υi,jψ2C·|βj*|2+C·σ2C·β*2+C·σ2. (C.74)

Since |ξi · yi,j · yi,k|=|yi,j · yi,k|, by definition ξi · yi,j · yi,k and yi,j · yi,k have the same ψ1-norm. By Lemma D.2 we have

ξi·yi,j·yi,kψ1C·max{yi,jψ22,yi,kψ22}C·(β*2+C·σ2).

Then by Lemma 5.15 in Vershynin (2010), we obtain

𝔼β*[exp(u·ξi·yi,j·yi,k)]exp[(u)2·C·(β*2+C·σ2)] (C.75)

for all |u|C/(β*2+C·σ2). Note that on the right-hand side of (C.73), we have

𝔼β*{exp[u·4/σ2·|1ni=1nξi·yi,j·yi,k|]}𝔼β*(max{exp[u·4/σ2·1ni=1nξi·yi,j·yi,k], exp[u·4/σ2·1ni=1nξi·yi,j·yi,k]})𝔼β*{exp[u·4/σ2·1ni=1nξi·yi,j·yi,k]}+𝔼β*{exp[u·4/σ2·1ni=1nξi·yi,j·yi,k]}. (C.76)

By plugging (C.75) into the right-hand side of (C.76) with u′ = u · 4/(σ2 · n) and u′ = −u · 4/(σ2 · n), from (C.73) we have that for any j, k ∈ {1, …, d},

𝔼β*{exp[u·|{Tn(β*)𝔼β*[Tn(β*)]}j,k|]}2·exp[C·u2/n·(β*2+C·σ2)2/σ4]. (C.77)

Therefore, by Chernoff bound we have that, for all υ > 0 and |u|C/(β*2+C·σ2),

[Tn(β*)𝔼β*[Tn(β*)],>υ]𝔼β*{exp[u·Tn(β*)𝔼β*[Tn(β*)],]}/exp(u·υ)j=1dk=1d𝔼β*{exp[u·|{Tn(β*)𝔼β*[Tn(β*)]}j,k|]}/exp(u·υ)2·exp[C·u2/n·(β*2+C·σ2)2/σ4u·υ+2·log d],

where the last inequality is from (C.77). By minimizing its right-hand side over u, we conclude that for 0<υC·(β*2+C·σ2),

[Tn(β*)𝔼β*[Tn(β*)],>υ]2·exp{n·υ2/[C·(β*2+C·σ2)2/σ4]+2·log d}.

Setting the right-hand side to be δ, we have that

Tn(β*)𝔼β*[Tn(β*)],υ=C·(β*2+C·σ2)/σ2·log(2/δ)+2·log dn

holds with probability at least 1 − δ. By setting δ = 2/d, we conclude the proof of Lemma 4.9.

C.8 Proof of Lemma 4.10

Proof

For any j, k ∈ {1, …, d}, by the mean-value theorem we have

Tn(β)Tn(β*),=maxj,k{1,,d}|[Tn(β)]j,k[Tn(β*)]j,k|=maxj,k{1,,d}|(ββ*)·[Tn(β)]j,k|ββ*1·maxj,k{1,,d}[Tn(β)]j,k, (C.78)

where β is an intermediate value between β and β*. According to (C.69), (C.70) and the definition of Tn(·) in (2.8), by calculation we have

[Tn(β)]j,k=1ni=1nν¯β(yi)·yi,j·yi,k·yi,

where

ν¯β(y)=8/σ4[1+exp(2·β,y/σ2)]·[1+exp(2·β,y/σ2)]28/σ4[1+exp(2·β,y/σ2)]2·[1+exp(2·β,y/σ2)]. (C.79)

For notational simplicity, we define the following event

={yiτ, for all i=1,,n},

where τ > 0 will be specified later. By maximal inequality we have

{[Tn(β)]j,k>υ}d·({|[Tn(β)]j,k|}l>υ)=d·[|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l|>υ]. (C.80)

Let ℰ̅ be the complement of ℰ. On the right-hand side of (C.80) we have

[|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l|>υ]=[|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l|>υ,](i)+(¯)(ii). (C.81)
Analysis of Term (i)

For term (i) in (C.81), we have

[|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l|>υ,]=[|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l·𝟙{yiτ}|>υ,][|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l·𝟙{yiτ}|>υ]i=1n[|ν¯β(yi)·yi,j·yi,k·yi,l·𝟙{yiτ}|>υ],

where the last inequality is from union bound. By (C.79) we have |ν̅β(yi)|≤16/σ4. Thus we obtain

[|ν¯β(yi)·yi,j·yi,k·yi,l·𝟙{yiτ}|>υ][|yi,j·yi,k·yi,l·𝟙{yiτ}|>σ4/16·υ].

Taking υ = 16 · τ34, we have that the right-hand side is zero and hence term (i) in (C.81) is zero.

Analysis of Term (ii)

Meanwhile, for term (ii) in (C.81) by maximal inequality we have

(¯)=(maxi{1,,n}yi>τ)n·(yi>τ)n·d·(|yi,j|>τ).

Furthermore, by (C.74) in the proof of Lemma 4.8, we have that yi,j is sub-Gaussian with yi,jψ2=C·β*2+C·σ2. Therefore, by Lemma 5.5 in Vershynin (2010) we have

(¯)n·d·(|yi,j|>τ)n·d·2·exp[C·τ2/(β*2+C·σ2)].

To ensure the right-hand side is upper bounded by δ, we set τ to be

τ=C·β*2+C·σ2·log d+log n+log(2/δ). (C.82)

Finally, by (C.80), (C.81) and maximal inequality we have

{maxj,k{1,,d}[Tn(β)]j,k>υ}d2·d·[|1ni=1nν¯β(yi)·yi,j·yi,k·yi,l|>υ]d3·δ

for υ = 16 · τ34 with τ specified in (C.82). By setting δ = 2 · d−4 and plugging (C.82) into (C.78), we conclude the proof of Lemma 4.10.

C.9 Proof of Lemma 4.12

By the same proof of Lemma 4.8 in §C.6, we obtain ζEM by invoking Theorem 3.10. To obtain ζG, note that for the gradient implementation of the M-step (Algorithm 3), we have

Mn(β)=β+η·1Qn(β;β), and M(β)=β+η·1Q(β;β).

Hence, we obtain ‖∇1Qn(β*; β*) − ∇1Q(β*; β*)‖ = 1/η · ‖Mn (β*) − M(β*)‖ . Setting δ = 4/d and s = s* in (3.22) of Lemma 3.9, we then obtain ζG.

C.10 Proof of Lemma 4.13

Proof

Recall that for mixture of regression model, we have

Qn(β;β)=12ni=1n{ωβ(xi,yi)·(yixi,β)2+[1ωβ(xi,yi)]·(yi+xi,β)2},

where ωβ(·) is defined in (A.3). Hence, by calculation we have

1Qn(β,β)=1ni=1n[2·ωβ(xi,yi)·yi·xixi·xi·β],1,12Qn(β,β)=1ni=1nxi·xi, (C.83)
1,22Qn(β,β)=4ni=1nyi2·xi·xiσ2·[1+exp(2·yi·β,xi/σ2)]·[1+exp(2·yi·β,xi/σ2)]. (C.84)

For notational simplicity we define

νβ(x,y)=4σ2·[1+exp(2·y·β,x/σ2)]·[1+exp(2·y·β,x/σ2)]. (C.85)

Then by the definition of Tn(·) in (2.8), from (C.83) and (C.84) we have

{Tn(β*)𝔼β*[Tn(β*)]}j,k=1ni=1nνβ*(xi,yi)·xi,j·xi,k·yi2𝔼β*[νβ*(Y,X)·Xj·Xk·Yi2].

Applying the symmetrization result in Lemma D.4 to the right-hand side, we have that for u > 0,

𝔼β*{exp[u·|{Tn(β*)𝔼β*[Tn(β*)]}j,k|]}𝔼β*{exp[u·|1ni=1nξi·νβ*(xi,yi)·xi,j·xi,k·yi2|]}, (C.86)

where ξ1, …, ξn are i.i.d. Rademacher random variables, which are independent of x1, …, xn and y1, …, yn. Then we invoke the contraction result in Lemma D.5 by setting

f(xi,j·xi,k·yi2)=xi,j·xi,k·yi2,={f},ψi(υ)=νβ*(xi,yi)·υ, and ϕ(υ)=exp(u·υ),

where u is the variable of the moment generating function in (C.86). By the definition in (C.85) we have |νβ*(xi, yi)| ≤ 4/σ2, which implies

|ψi(υ)ψi(υ)||νβ*(xi,yi)·(υυ)|4/σ2·|υυ|, for all υ,υ.

Therefore, applying Lemma D.5 to the right-hand side of (C.86), we obtain

𝔼β*{exp[u·|{Tn(β*)𝔼β*[Tn(β*)]}j,k|]}𝔼β*{exp[u·4/σ2·|1ni=1nξi·xi,j·xi,k·yi2|]}. (C.87)

For notational simplicity, we define the following event

={xiτ, for all i=1,,n}.

Let ℰ̅ be the complement of ℰ. We consider the following tail probability

[4/σ2·|1ni=1nξi·xi,j·xi,k·yi2|>υ][4/σ2·|1ni=1nξi·xi,j·xi,k·yi2|>υ,](i)+(¯)(ii). (C.88)
Analysis of Term (i)

For term (i) in (C.88), we have

[4/σ2·|1ni=1nξi·xi,j·xi,k·yi2|>υ,]=[4/σ2·|1ni=1nξi·xi,j·xi,k·yi2·𝟙{xiτ}|>υ,][4/σ2·|1ni=1nξi·xi,j·xi,k·yi2·𝟙{xiτ}|>υ].

Here note that 𝔼β*(ξi·xi,j·xi,k·yi2·𝟙{xiτ})=0, because ξi is a Rademacher random variable independent of xi and yi. Recall that for mixture of regression model we have yi = zi · 〈β*, xi〉 + υi, where zi is a Rademacher random variable, xi ~ N(0, Id) and υi ~ N(0, σ2). According to Example 5.8 in Vershynin (2010), we have ‖zi · 〈β*, xi〉 · 𝟙 {‖xi ≤ τ}‖ψ2 = ‖〈β*, xi〉 · 𝟙 {‖xi ≤ τ}‖ψ2 ≤ τ · ‖β*‖1 and ‖υi · 𝟙 {‖xi ≤ τ}‖ψ2 ≤ ‖υiψ2C · σ. Hence, by Lemma D.1 we have

yi·𝟙{xiτ}ψ2=zi·β*,xi·𝟙{xiτ}+υi·𝟙{xiτ}ψ2C·τ2·β*12+C·σ2. (C.89)

By the definition of ψ1-norm, we have ξi·xi,j·xi,k·yi2·𝟙{xiτ}ψ1τ2·yi2·𝟙{xiτ}ψ1. Further by applying Lemma D.2 to its right-hand side with Z1 = Z2 = yi · 𝟙 {‖xi≤τ}, we obtain

ξi·xi,j·xi,k·yi2·𝟙{xiτ}ψ1C·τ2·yi·𝟙{xiτ}ψ22C·τ2·(τ2·β*12+C·σ2),

where the last inequality follows from (C.89). Therefore, by Bernstein’s inequality (Proposition 5.16 in Vershynin (2010)), we obtain

[4/σ2·|1ni=1nξi·xi,j·xi,k·yi2·𝟙{xiτ}|>υ]2·exp[C·n·υ2·σ4τ4·(τ2·β*12+C·σ2)2], (C.90)

for all 0υC·τ2·(β*12+C·σ2) and a sufficiently large sample size n.

Analysis of Term (ii)

For term (ii) in (C.88), by union bound we have

(¯)=(max1inxi>τ)n·(xi>τ)n·d·(|xi,j|>τ).

Moreover, xi,j is sub-Gaussian with ‖xi,jψ2 = C. Thus, by Lemma 5.5 in Vershynin (2010) we have

(¯)n·d·2·exp(C·τ2)=2·exp(C·τ2+log n+log d). (C.91)

Plugging (C.90) and (C.91) into (C.88), we obtain

[4/σ2·|1ni=1nξi·xi,j·xi,k·yi2|>υ]2·exp [C·n·υ2·σ4τ4·(τ2·β*12+C·σ2)2]+2·exp(C·τ2+log n+log d). (C.92)

Note that (C.87) is obtained by applying Lemmas D.4 and D.5 with ϕ(υ) = exp(u·υ). Since Lemmas D.4 and D.5 allow any increasing convex function ϕ(·), similar results hold correspondingly. Hence, applying Panchenko’s theorem in Lemma D.6 to (C.87), from (C.92) we have

[|{Tn(β*)𝔼β*[Tn(β*)]}j,k|>υ]2·e·exp [C·n·υ2·σ4τ4·(τ2·β*12+C·σ2)2]+2·e·exp(C·τ2+log n+log d).

Furthermore, by union bound we have

[Tn(β*)𝔼β*[Tn(β*)],>υ]j=1dk=1d[|{Tn(β*)𝔼β*[Tn(β*)]}j,k|>υ]2·e·exp [C·n·υ2·σ4τ4·(τ2·β*12+C·σ2)2+2·log d]+2·e·exp(C·τ2+log n+3·log d). (C.93)

To ensure the right-hand side is upper bounded by δ, we set the second term on the right-hand side of (C.93) to be δ/2. Then we obtain

τ=C·log n+3·log d+log(4·e/δ).

Let the first term on the right-hand side of (C.93) be upper bounded by δ/2 and plugging in τ, we then obtain

υ=C·[log n+3·log d+log(4·e/δ)]·{[log n+3·log d+log(4·e/δ)]·β*12+C·σ2}/σ2·log(4·e/δ)+2·log dn.

Therefore, by setting δ = 4 · e/d we have that

Tn(β*)𝔼β*[Tn(β*)],υ=C·(log n+4·log d)·[(log n+4·log d)·β*12+C·σ2]/σ2·log dn

holds with probability at least 1 – 4 · e/d, which completes the proof of Lemma 4.13.

C.11 Proof of Lemma 4.14

Proof

For any j, k ∈ {1, …, d}, by the mean-value theorem we have

Tn(β)Tn(β*),=maxj,k{1,,d}|[Tn(β)]j,k[Tn(β*)]j,k|=maxj,k{1,,d}|(ββ*)·[Tn(β)]j,k|ββ*1·maxj,k{1,,d}[Tn(β)]j,k, (C.94)

where β is an intermediate value between β and β*. According to (C.83), (C.84) and the definition of Tn(·) in (2.8), by calculation we have

[Tn(β)]j,k=1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,

where

ν¯β(x,y)=8/σ4[1+exp(2·y·β,x/σ2)]·[1+exp(2·y·β,x/σ2)]28/σ4[1+exp(2·y·β,x/σ2)]2·[1+exp(2·y·β,x/σ2)]. (C.95)

For notational simplicity, we define the following events

={xiτ, for all i=1,,n}, and ={|υi|τ, for all i=1,,n},

where τ > 0 and τ′ > 0 will be specified later. By union bound we have

{[Tn(β)]j,k>υ}d·({|[Tn(β)]j,k|}l>υ)=d·[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l|>υ]. (C.96)

Let ℰ̅ and ℰ̅′ be the complement of ℰ and ℰ′ respectively. On the right-hand side we have

[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l|>υ]=[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l|>υ,,](i)+(¯)(ii)+(¯)(iii). (C.97)
Analysis of Term (i)

For term (i) in (C.97), we have

[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l|>υ,,]=[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l·𝟙{xiτ}·𝟙{|υi|τ}|>υ,,][|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l·𝟙{xiτ}·𝟙{|υi|τ}|>υ].

To avoid confusion, note that υi is the noise in mixture of regression model, while υ appears in the tail bound. By applying union bound to the right-hand side of the above inequality, we have

[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l|>υ,,]i=1n[|ν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l·𝟙{xiτ}·𝟙{|υi|τ}|>υ].

By (C.95) we have |ν̅β(xi, yi)| ≤ 16/σ4. Hence, we obtain

[|ν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l·𝟙{xiτ}·𝟙{|υi|τ}|>υ][|yi3·xi,j·xi,k·xi,l·𝟙{xiτ}·𝟙{|υi|τ}|>σ4/16·υ]. (C.98)

Recall that in mixture of regression model we have yi = zi · 〈β*, xi〉 + υi, where zi is a Rademacher random variable, xi ~ N(0, Id) and υi ~ N(0, σ2). Hence, we have

|yi3·𝟙{xiτ}·𝟙{|υi|τ}|(|zi·β*,xi·𝟙{xiτ}|+|υi·𝟙{|υi|τ}|)3(τ·β*1+τ)3,
|xi,j·xi,k·xi,l·𝟙{xiτ}||xi,j·𝟙{xiτ}|3τ3.

Taking υ = 16 · (τ · ‖β*‖1 + τ′)3 · τ34, we have that the right-hand side of (C.98) is zero. Hence term (i) in (C.97) is zero.

Analysis of Term (ii)

For term (ii) in (C.97), by union bound we have

(¯)=(maxi{1,,n}xi>τ)n·(xi>τ)n·d·(|xi,j|>τ).

Moreover, we have that xi,j is sub-Gaussian with ‖xi,jψ2 = C. Therefore, by Lemma 5.5 in Vershynin (2010) we have

(¯)n·d·(|xi,j|>τ)n·d·2·exp(C·τ2).
Analysis of Term (iii)

Since υi is sub-Gaussian with ‖υiψ2 = C · σ, by Lemma 5.5 in Vershynin (2010) and union bound, for term (iii) in (C.97) we have

(¯)n·(|υi|>τ)n·2·exp(C·τ2/σ2).

To ensure the right-hand side of (C.97) is upper bounded by δ, we set τ and τ′ to be

τ=C·log d+log n+log(4/δ), and τ=C·σ·log n+log(4/δ) (C.99)

to ensure terms (ii) and (iii) are upper bounded by δ/2 correspondingly. Finally, by (C.96), (C.97) and union bound we have

{maxj,k{1,,d}[Tn(β)]j,k>υ}d2·d·[|1ni=1nν¯β(xi,yi)·yi3·xi,j·xi,k·xi,l|>υ]d3·δ

for υ = 16 · (τ · ‖β*‖1 + τ′)3 · τ34 with τ and τ′ specified in (C.99). Then by setting δ = 4 · d−4 and plugging it into (C.99), we have

υ=16·[C·5·log d+log n·β*1+C·σ·4·log d+log n]3·[C·5·log d+log n]3/σ4C·(β*1+C·σ)3·(5·log d+log n)3,

which together with (C.94) concludes the proof of Lemma 4.14.

D Auxiliary Results

In this section, we lay out several auxiliary lemmas. Lemmas D.1–D.3 provide useful properties of sub-Gaussian random variables. Lemmas D.4 and D.5 establish the symmetrization and contraction results. Lemma D.6 is Panchenko’s theorem. For more details of these results, see Vershynin (2010); Boucheron et al. (2013).

Lemma D.1

Let Z1, …, Zk be the k independent zero-mean sub-Gaussian random variables, for Z=j=1kZj we have Zψ22C·j=1kZjψ22, where C > 0 is an absolute constant.

Lemma D.2

For Z1 and Z2 being two sub-Gaussian random variables, Z1 · Z2 is a sub-exponential random variable with

Z1·Z2ψ1C·max{Z1ψ22,Z2ψ22},

where C > 0 is an absolute constant.

Lemma D.3

For Z being sub-Gaussian or sub-exponential, it holds that ‖Z − 𝔼Zψ2 ≤ 2 · ‖Zψ2 or ‖Z − 𝔼Zψ1 ≤ 2 · ‖Zψ1 correspondingly.

Lemma D.4

Let z1, …, zn be the n independent realizations of the random vector Z ∈ 𝒵 and ℱ be a function class defined on 𝒵. For any increasing convex function ϕ(·) we have

𝔼{ϕ[supf|i=1nf(zi)𝔼Z|]}𝔼{ϕ[supf|i=1nξi·f(zi)|]},

where ξ1, .…, ξn are i.i.d. Rademacher random variables that are independent of z1, …, zn.

Lemma D.5

Let z1, …, zn be the n independent realizations of the random vector Z ∈ 𝒵 and ℱ be a function class defined on 𝒵. We consider the Lipschitz functions ψi(·) (i = 1, …, n) that satisfy

|ψi(υ)ψi(υ)|L·|υυ|, for all υ,υ,

and ψi(0) = 0. For any increasing convex function ϕ(·) we have

𝔼{ϕ[|supfi=1nξi·ψi[f(zi)]|]}𝔼{ϕ[2·|L·supfi=1nξi·f(zi)|]},

where ξ1, …, ξn are i.i.d. Rademacher random variables that are independent of z1, …, zn.

Lemma D.6

Suppose that Z1 and Z2 are two random variables that satisfy 𝔼[ϕ(Z1)] ≤ 𝔼[ϕ(Z2)] for any increasing convex function ϕ(·). Assuming that ℙ(Z1 ≥ υ) ≤ C · exp(−C′ · υα) (α ≥ 1) holds for all υ ≥ 0, we have ℙ(Z2 ≥ υ) ≤ C · exp(1 − C′ · υα).

References

  1. Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research. 2014a;15:2773–2832. [Google Scholar]
  2. Anandkumar A, Ge R, Janzamin M. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. arXiv preprint arXiv:1411.1488. 2014b [Google Scholar]
  3. Anandkumar A, Ge R, Janzamin M. Provable learning of overcomplete latent variable models: Semi-supervised and unsupervised settings. arXiv preprint arXiv:1408.0553. 2014c [Google Scholar]
  4. Balakrishnan S, Wainwright MJ, Yu B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156. 2014 [Google Scholar]
  5. Bartholomew DJ, Knott M, Moustaki I. Latent variable models and factor analysis: A Unified Approach. Vol. 899. Wiley; 2011. [Google Scholar]
  6. Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]
  7. Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies. 2014;81:608–650. [Google Scholar]
  8. Belloni A, Chernozhukov V, Wei Y. Honest confidence regions for a regression parameter in logistic regression with a large number of controls. arXiv preprint arXiv:1304.3969. 2013 [Google Scholar]
  9. Bickel PJ. One-step Huber estimates in the linear model. Journal of the American Statistical Association. 1975;70:428–434. [Google Scholar]
  10. Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37:1705–1732. [Google Scholar]
  11. Boucheron S, Lugosi G, Massart P. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013. [Google Scholar]
  12. Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
  13. Candès E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351. [Google Scholar]
  14. Chaganty AT, Liang P. Spectral experts for estimating mixtures of linear regressions. arXiv preprint arXiv:1306.3729. 2013 [Google Scholar]
  15. Chaudhuri K, Dasgupta S, Vattani A. Learning mixtures of Gaussians using the k-means algorithm. arXiv preprint arXiv:0912.0086. 2009 [Google Scholar]
  16. Chrétien S, Hero AO. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics. 2008;12:308–326. [Google Scholar]
  17. Dasgupta S, Schulman L. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research. 2007;8:203–226. [Google Scholar]
  18. Dasgupta S, Schulman LJ. A two-round variant of EM for Gaussian mixtures. In Uncertainty in artificial intelligence, 2000. 2000 [Google Scholar]
  19. Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1977;39:1–38. [Google Scholar]
  20. Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research. 2014;15:2869–2909. [Google Scholar]
  21. Khalili A, Chen J. Variables selection in finite mixture of regression models. Journal of the American Statistical Association. 2007;102:1025–1038. [Google Scholar]
  22. Knight K, Fu W. Asymptotics for Lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
  23. Lee JD, Sun DL, Sun Y, Taylor JE. Exact inference after model selection via the Lasso. arXiv preprint arXiv:1311.6238. 2013 [Google Scholar]
  24. Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the Lasso. The Annals of Statistics. 2014;42:413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Loh P-L, Wainwright MJ. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression; IEEE International Symposium on Information Theory, 2012; 2012. [Google Scholar]
  26. McLachlan G, Krishnan T. The EM algorithm and extensions. Vol. 382. Wiley; 2007. [Google Scholar]
  27. Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2010;72:417–473. [Google Scholar]
  28. Meinshausen N, Meier L, Bühlmann P. p-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
  29. Nesterov Y. Introductory lectures on convex optimization:A basic course. Vol. 87. Springer; 2004. [Google Scholar]
  30. Nickl R, van de Geer S. Confidence sets in sparse regression. The Annals of Statistics. 2013;41:2852–2876. [Google Scholar]
  31. Städler N, Bühlmann P, van de Geer S. ℓ1-penalization for mixture regression models. TEST. 2010;19:209–256. [Google Scholar]
  32. Taylor J, Lockhart R, Tibshirani RJ, Tibshirani R. Post-selection adaptive inference for least angle regression and the Lasso. arXiv preprint arXiv:1401.3889. 2014 [Google Scholar]
  33. Tseng P. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of Operations Research. 2004;29:27–44. [Google Scholar]
  34. van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42:1166–1202. [Google Scholar]
  35. van der Vaart AW. Asymptotic statistics. Vol. 3. Cambridge University Press; 2000. [Google Scholar]
  36. Vershynin R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. 2010 [Google Scholar]
  37. Wasserman L, Roeder K. High-dimensional variable selection. The Annals of Statistics. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wu CFJ. On the convergence properties of the EM algorithm. The Annals of Statistics. 1983;11:95–103. [Google Scholar]
  39. Yi X, Caramanis C, Sanghavi S. Alternating minimization for mixed linear regression. arXiv preprint arXiv:1310.3745. 2013 [Google Scholar]
  40. Yuan X-T, Li P, Zhang T. Gradient hard thresholding pursuit for sparsity-constrained optimization. arXiv preprint arXiv:1311.5750. 2013 [Google Scholar]
  41. Yuan X-T, Zhang T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]
  42. Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:217–242. [Google Scholar]

RESOURCES