High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality

Zhaoran Wang; Quanquan Gu; Yang Ning; Han Liu

. Author manuscript; available in PMC: 2017 Jun 12.

Published in final edited form as: Adv Neural Inf Process Syst. 2015;28:2512–2520.

High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality

Zhaoran Wang ^*, Quanquan Gu ^*, Yang Ning ^*, Han Liu ^*

PMCID: PMC5467221 NIHMSID: NIHMS752348 PMID: 28615917

Abstract

We provide a general theory of the expectation-maximization (EM) algorithm for inferring high dimensional latent variable models. In particular, we make two contributions: (i) For parameter estimation, we propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure into parameter estimation. With an appropriate initialization, this algorithm converges at a geometric rate and attains an estimator with the (near-)optimal statistical rate of convergence. (ii) Based on the obtained estimator, we propose new inferential procedures for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. For a broad family of statistical models, our framework establishes the first computationally feasible approach for optimal estimation and asymptotic inference in high dimensions. Our theory is supported by thorough numerical results.

1 Introduction

The expectation-maximization (EM) algorithm (Dempster et al., 1977) is the most popular approach for calculating the maximum likelihood estimator of latent variable models. Nevertheless, due to the nonconcavity of the likelihood function of latent variable models, the EM algorithm generally only converges to a local maximum rather than the global one (Wu, 1983). On the other hand, existing statistical guarantees for latent variable models are only established for global optima (Bartholomew et al., 2011). Therefore, there exists a gap between computation and statistics.

Significant progress has been made toward closing the gap between the local maximum attained by the EM algorithm and the maximum likelihood estimator (Wu, 1983; Tseng, 2004; McLachlan and Krishnan, 2007; Chrétien and Hero, 2008; Balakrishnan et al., 2014). In particular, Wu (1983) first establishes general sufficient conditions for the convergence of the EM algorithm. Tseng (2004); Chrétien and Hero (2008) further improve this result by viewing the EM algorithm as a proximal point method applied to the Kullback-Leibler divergence. See McLachlan and Krishnan (2007) for a detailed survey. More recently, Balakrishnan et al. (2014) establish the first result that characterizes explicit statistical and computational rates of convergence for the EM algorithm. They prove that, given a suitable initialization, the EM algorithm converges at a geometric rate to a local maximum close to the maximum likelihood estimator. All these results are established in the low dimensional regime where the dimension d is much smaller than the sample size n.

In high dimensional regimes where the dimension d is much larger than the sample size n, there exists no theoretical guarantee for the EM algorithm. In fact, when d ≫ n, the maximum likelihood estimator is in general not well defined, unless the models are carefully regularized by sparsity-type assumptions. Furthermore, even if a regularized maximum likelihood estimator can be obtained in a computationally tractable manner, establishing the corresponding statistical properties, especially asymptotic normality, can still be challenging because of the existence of high dimensional nuisance parameters. To address such a challenge, we develop a general inferential theory of the EM algorithm for parameter estimation and uncertainty assessment of high dimensional latent variable models. In particular, we make two contributions in this paper:

For high dimensional parameter estimation, we propose a novel high dimensional EM algorithm by attaching a truncation step to the expectation step (E-step) and maximization step (M-step). Such a truncation step effectively enforces the sparsity of the attained estimator and allows us to establish significantly improved statistical rate of convergence.
Based upon the estimator attained by the high dimensional EM algorithm, we propose a family of decorrelated score and Wald statistics for testing hypotheses for low dimensional components of the high dimensional parameter. The decorrelated Wald statistic can be further used to construct optimal valid confidence intervals for low dimensional parameters of interest.

Under a unified analytic framework, we establish simultaneous statistical and computational guarantees for the proposed high dimensional EM algorithm and the respective uncertainty assessment procedures. Let β* ∈ ℝ^d be the true parameter, s* be its sparsity level and ${β^{(t)}}_{t = 0}^{T}$ be the iterative solution sequence of the high dimensional EM algorithm with T being the total number of iterations. In particular, we prove that:

Given an appropriate initialization β^init with relative error upper bounded by a constant κ ∈ (0, 1), i.e., ‖β^init − β*‖₂/‖β*‖₂ ≤ κ, the iterative solution sequence ${β^{(t)}}_{t = 0}^{T}$ satisfies
${‖ β^{(t)} - β^{*} ‖}_{2} \leq \underset{Optimization Error}{\underset{︸}{Δ_{1} \cdot ρ^{t / 2}}} + \underset{Statistical Error}{\underset{︸}{Δ_{2} \cdot \overset{Optimal Rate}{\overset{︷}{\sqrt{\frac{s^{*} \cdot log d}{n}}}}}}$ (1.1)
with high probability. Here ρ ∈(0, 1), and Δ₁, Δ₂ are quantities that possibly depend on ρ, κ and β*. As the optimization error term in (1.1) decreases to zero at a geometric rate with respect to t, the overall estimation error achieves the $\sqrt{s^{*} \cdot log d / n}$ statistical rate of convergence (up to an extra factor of log n), which is (near-)minimax-optimal. See Theorem 3.4 and the corresponding discussion for details.
The proposed decorrelated score and Wald statistics are asymptotically normal. Moreover, their limiting variances and the size of the respective confidence interval are optimal in the sense that they attain the semiparametric information bound for the low dimensional components of interest in the presence of high dimensional nuisance parameters. See Theorems 4.6 and 4.7 for details.

Our framework allows two implementations of the M-step: the exact maximization versus approximate maximization. The former one calculates the maximizer exactly, while the latter one conducts an approximate maximization through a gradient ascent step. Our framework is quite general. We illustrate its effectiveness by applying it to three high dimensional latent variable models, including Gaussian mixture model, mixture of regression model and regression with missing covariates.

Comparison with Related Work

A closely related work is by Balakrishnan et al. (2014), which considers the low dimensional regime where d is much smaller than n. Under certain initialization conditions, they prove that the EM algorithm converges at a geometric rate to some local optimum that attains the $\sqrt{d / n}$ statistical rate of convergence. They cover both maximization and gradient ascent implementations of the M-step, and establish the consequences for the three latent variable models considered in our paper under low dimensional settings. Our framework adopts their view of treating the EM algorithm as a perturbed version of gradient methods. However, to handle the challenge of high dimensionality, the key ingredient of our framework is the truncation step that enforces the sparsity structure along the solution path. Such a truncation operation poses significant challenges for both computational and statistical analysis. In detail, for computational analysis we need to carefully characterize the evolution of each intermediate solution’s support and its effects on the evolution of the entire iterative solution sequence. For statistical analysis, we need to establish a fine-grained characterization of the entrywise statistical error, which is technically more challenging than just establishing the ℓ₂-norm error employed by Balakrishnan et al. (2014). In high dimensional regimes, we need to establish the $\sqrt{s^{*} \cdot log d / n}$ statistical rate of convergence, which is much sharper than their $\sqrt{d / n}$ rate when d ≫ n. In addition to point estimation, we further construct confidence intervals and hypothesis tests for latent variable models in the high dimensional regime, which have not been established before.

High dimensionality poses significant challenges for assessing the uncertainty (e.g., constructing confidence intervals and testing hypotheses) of the constructed estimators. For example, Knight and Fu (2000) show that the limiting distribution of the Lasso estimator is not Gaussian even in the low dimensional regime. A variety of approaches have been proposed to correct the Lasso estimator to attain asymptotic normality, including the debiasing method (Javanmard and Montanari, 2014), the desparsification methods (Zhang and Zhang, 2014; van de Geer et al., 2014) as well as instrumental variable-based methods (Belloni et al., 2012, 2013, 2014). Meanwhile, Lockhart et al. (2014); Taylor et al. (2014); Lee et al. (2013) propose the post-selection procedures for exact inference. In addition, several authors propose methods based on data splitting (Wasserman and Roeder, 2009; Meinshausen et al., 2009), stability selection (Meinshausen and Bühlmann, 2010) and ℓ₂-confidence sets (Nickl and van de Geer, 2013). However, these approaches mainly focus on generalized linear models rather than latent variable models. In addition, their results heavily rely on the fact that the estimator is a global optimum of a convex program. In comparison, our approach applies to a much broader family of statistical models with latent structures. For these latent variable models, it is computationally infeasible to obtain the global maximum of the penalized likelihood due to the nonconcavity of the likelihood function. Unlike existing approaches, our inferential theory is developed for the estimator attained by the proposed high dimensional EM algorithm, which is not necessarily a global optimum to any optimization formulation.

Another line of research for the estimation of latent variable models is the tensor method, which exploits the structures of third or higher order moments. See Anandkumar et al. (2014a,b,c) and the references therein. However, existing tensor methods primarily focus on the low dimensional regime where d≪n. In addition, since the high order sample moments generally have a slow statistical rate of convergence, the estimators obtained by the tensor methods usually have a suboptimal statistical rate even for d≪n. For example, Chaganty and Liang (2013) establish the $\sqrt{d^{6} / n}$ statistical rate of convergence for mixture of regression model, which is suboptimal compared with the $\sqrt{d / n}$ minimax lower bound. Similarly, in high dimensional settings, the statistical rates of convergence attained by tensor methods are significantly slower than the statistical rate obtained in this paper.

The three latent variable models considered in this paper have been well studied. Nevertheless, only a few works establish theoretical guarantees for the EM algorithm. In particular, for Gaussian mixture model, Dasgupta and Schulman (2000, 2007); Chaudhuri et al. (2009) establish parameter estimation guarantees for the EM algorithm and its extensions. For mixture of regression model, Yi et al. (2013) establish exact parameter recovery guarantees for the EM algorithm under a noiseless setting. For high dimensional mixture of regression model, Städler et al. (2010) analyze the gradient EM algorithm for the ℓ₁-penalized log-likelihood. They establish support recovery guarantees for the attained local optimum but have no parameter estimation guarantees. In comparison with existing works, this paper establishes a general inferential framework for simultaneous parameter estimation and uncertainty assessment based on a novel high dimensional EM algorithm. Our analysis provides the first theoretical guarantee of parameter estimation and asymptotic inference in high dimensional regimes for the EM algorithm and its applications to a broad family of latent variable models.

Notation

Let A = [A_i,j] ∈ ℝ^d×d and v = (υ₁, …, υ_d)^⊤ ∈ ℝ^d. We define the ℓ_q-norm (q≥1) of v as ${‖ v ‖}_{q} = {(\sum_{j = 1}^{d} {| υ_{j} |}^{q})}^{1 / q}$ . Particularly, ‖v‖₀ denotes the number of nonzero entries of v. For q ≥ 1, we define ‖A‖_q as the operator norm of A. Specifically, ‖A‖₂ is the spectral norm. For a set 𝒮, |𝒮| denotes its cardinality. We denote the d×d identity matrix as I_d. For index sets ℐ, 𝒥 ⊆ {1, …, d}, we define A_ℐ,𝒥 ∈ℝ^d×d to be the matrix whose (i, j)-th entry is equal to A_i,j if i∈ℐ and j∈𝒥, and zero otherwise. We define v_ℐ similarly. We denote ⊗ and ⊙ to be the outer product and Hadamard product between vectors. The matrix (p, q)-norm, i.e., ‖A‖_p,q, is obtained by taking the ℓ_p-norm of each row and then taking the ℓ_q-norm of the obtained row norms. Let supp(v) be the support of v, i.e., the index set of its nonzero entries. We use C,C′, … to denote generic absolute constants. The values of these constants may vary from line to line. In addition, we denote ‖·‖_{ψ_q} (q ≥ 1) to be the Orlicz norm of random variables. We will introduce more notations in §2.2.

The rest of the paper is organized as follows. In §2 we present the high dimensional EM algorithm and the corresponding procedures for inference, and then apply them to three latent variable models. In §3 and §4, we establish the main theoretical results for computation, parameter estimation and asymptotic inference, as well as their implications for specific latent variable models. In §5 we sketch the proof of the main results. In §6 we present the numerical results. In §7 we conclude the paper.

2 Methodology

We first introduce the high dimensional EM Algorithm. Then we present the respective procedures for asymptotic inference. Finally, we apply the proposed methods to three latent variable models.

2.1 High Dimensional EM Algorithm

Before we introduce the proposed high dimensional EM Algorithm (Algorithm 1), we briefly review the classical EM algorithm. Let h_β(y) be the probability density function of Y ∈ 𝒴, where β ∈ ℝ^d is the model parameter. For latent variable models, we assume that h_β(y) is obtained by marginalizing over an unobserved latent variable Z ∈ 𝒵, i.e.,

h_{β} (y) = \int_{𝒵} f_{β} (y, z) d z .

(2.1)

Given the n observations y₁, …, y_n of Y, the EM algorithm aims at maximizing the log-likelihood

ℓ_{n} (β) = \sum_{i = 1}^{n} log h_{β} (y_{i}) .

(2.2)

Due to the unobserved latent variable Z, it is difficult to directly evaluate ℓ_n(β). Instead, we turn to consider the difference between ℓ_n(β) and ℓ_n(β′). Let k_β(z | y) be the density of Z conditioning on the observed variable Y = y, i.e.,

k_{β} (z | y) = f_{β} (y, z) / h_{β} (y) .

(2.3)

According to (2.1) and (2.2), we have

\frac{1}{n} \cdot [ℓ_{n} (β) - ℓ_{n} (β')] = \frac{1}{n} \sum_{i = 1}^{n} log [h_{β} (y_{i}) / h_{β'} (y_{i})] = \frac{1}{n} \sum_{i = 1}^{n} log [\int_{𝒵} \frac{f_{β} (y_{i}, z)}{h_{β'} (y_{i})} d z] = \frac{1}{n} \sum_{i = 1}^{n} log [\int_{𝒵} k_{β'} (z | y_{i}) \cdot \frac{f_{β} (y_{i}, z)}{f_{β'} (y_{i}, z)} d z] \geq \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β'} (z | y_{i}) \cdot log [\frac{f_{β} (y_{i}, z)}{f_{β'} (y_{i}, z)}] d z,

(2.4)

where the third equality follows from (2.3) and the inequality is obtained from Jensen’s inequality. On the right-hand side of (2.4) we have

\frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β'} (z | y_{i}) \cdot log [\frac{f_{β} (y_{i}, z)}{f_{β'} (y_{i}, z)}] d z = \underset{Q_{n} (β; β')}{\underset{︸}{\frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β'} (z | y_{i}) \cdot log f_{β} (y_{i}, z) d z}} - \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β'} (z | y_{i}) \cdot log f_{β'} (y_{i}, z) d z .

(2.5)

We define the first term on the right-hand side of (2.5) to be Q_n(β; β′). Correspondingly, we define its expectation to be Q(β; β′). Note the second term on the right-hand side of (2.5) doesn’t depend on β. Hence, given some fixed β′, we can maximize the lower bound function Q_n(β; β′) over β to obtain a sufficiently large ℓ_n(β) − ℓ_n(β′). Based on such an observation, at the t-th iteration of the classical EM algorithm, we evaluate Q_n(β; β^(t)) at the E-step and then perform max_β Q_n(β; β^(t)) at the M-step. See McLachlan and Krishnan (2007) for more details.

Algorithm 1.

High Dimensional EM Algorithm

1:	Parameter: Sparsity Parameter ŝ, Maximum Number of Iterations T
2:	Initialization: 𝒮̂^init ← supp(β^init, ŝ), β⁽⁰⁾ ← trunc(β^init, 𝒮̂^init)
	{supp(·, ·) and trunc(·, ·) are defined in (2.6) and (2.7)}
3:	For t = 0 to T − 1
4:	E-step: Evaluate Q_n(β; β^(t))
5:	M-step: β^(t+0.5) ← M_n(β^(t)) {M_n(·) is implemented as in Algorithm 2 or 3}
6:	T-step: 𝒮̂^(t+0.5) ← supp(β^(t+0.5), ŝ), β^(t+1) ← trunc(β^(t+0.5), 𝒮̂^(t+0.5))
7:	End For
8:	Output: β̂ ← β^(T)

Open in a new tab

The proposed high dimensional EM algorithm (Algorithm 1) is built upon the E-step and M-step (lines 4 and 5) of the classical EM algorithm. In addition to the exact maximization implementation of the M-step (Algorithm 2), we allow the gradient ascent implementation of the M-step (Algorithm 3), which performs an approximate maximization via a gradient ascent step. To handle the challenge of high dimensionality, in line 6 of Algorithm 1 we perform a truncation step (T-step) to enforce the sparsity structure. In detail, we define the supp(·, ·) function in line 6 as

supp (β, s) : The set of index j ’ s corresponding to the top s largest | β_{j} | ’ s .

(2.6)

Also, for an index set 𝒮 ⊆ {1, …, d}, we define the trunc(·, ·) function in line 6 as

{[trunc (β, 𝒮)]}_{j} = {\begin{matrix} β_{j} & j \in 𝒮, \\ 0 & j \notin 𝒮 . \end{matrix}

(2.7)

Note that β^(t+0.5) is the output of the M-step (line 5) at the t-th iteration of the high dimensional EM algorithm. To obtain β^(t+1), the T-step (line 6) preserves the entries of β^(t+0.5) with the top ŝ large magnitudes and sets the rest to zero. Here ŝ is a tuning parameter that controls the sparsity level (line 1). By iteratively performing the E-step, M-step and T-step, the high dimensional EM algorithm attains an ŝ-sparse estimator β̂ = β^(T) (line 8). Here T is the total number of iterations.

It is worth noting that, the truncation strategy employed here and its variants are widely adopted in the context of sparse linear regression and sparse principal component analysis. For example, see Yuan and Zhang (2013); Yuan et al. (2013) and the references therein. Nevertheless, we incorporate this truncation strategy into the EM algorithm for the first time. Also, our analysis is significantly different from existing works.

Algorithm 2.

Maximization Implementation of the M-step

1:	Input: β^(t), Q_n(β; β^(t))
2:	Output: M_n(β^(t)) ← argmax_β Q_n(β; β^(t))

Open in a new tab

Algorithm 3.

Gradient Ascent Implementation of the M-step

1:	Input: β^(t), Q_n(β; β^(t))
2:	Parameter: Stepsize η > 0
3:	Output: M_n(β^(t)) ← β^(t) + η · ∇Q_n(β^(t); β^(t))
	{The gradient is taken with respect to the first β^(t) in Q_n(β^(t); β^(t))}

Open in a new tab

2.2 Asymptotic Inference

In the sequel, we first introduce some additional notations. Then we present the proposed methods for asymptotic inference in high dimensions.

Notation

Let ∇₁Q(β; β′) be the gradient with respect to β and ∇₂Q(β; β′) be the gradient with respect to β′. If there is no confusion, we simply denote ∇Q(β; β′) = ∇₁Q(β; β′) as in the previous sections. We define the higher order derivatives in the same manner, e.g., $\nabla_{1, 2}^{2} Q (β; β')$ is calculated by first taking derivative with respect to β and then with respect to β′. For $β = {(β_{1}^{⊤}, β_{2}^{⊤})}^{⊤} \in ℝ^{d}$ with β₁ ∈ ℝ^d₁, β₂ ∈ ℝ^d₂ and d₁ + d₂ = d, we use notations such as v_β₁ ∈ ℝ^d₁ and A_β₁,β₂ ∈ ℝ^d₁×d₂ to denote the corresponding subvector of v ∈ ℝ^d and the submatrix of A ∈ ℝ^d×d.

We aim to conduct asymptotic inference for low dimensional components of the high dimensional parameter β*. Without loss of generality, we consider a single entry of β*. In particular, we assume β* = [α*, (γ*)^⊤]^⊤, where α* ∈ ℝ is the entry of interest, while γ* ∈ ℝ^d−1 is treated as the nuisance parameter. In the following, we construct two hypothesis tests, namely, the decorrelated score and Wald tests. Based on the decorrelated Wald test, we further construct valid confidence interval for α*. It is worth noting that, our method and theory can be easily generalized to perform statistical inference for an arbitrary low dimensional subvector of β*.

Decorrelated Score Test

For score test, we are primarily interested in testing H₀ : α* = 0, since this null hypothesis characterizes the uncertainty in variable selection. Our method easily generalizes to H₀ : α* = α₀ with α₀ ≠ 0. For notational simplicity, we define the following key quantity

T_{n} (β) = \nabla_{1, 1}^{2} Q_{n} (β; β) + \nabla_{1, 2}^{2} Q_{n} (β; β) \in ℝ^{d \times d} .

(2.8)

Let β = (α, γ^⊤)^⊤. We define the decorrelated score function S_n(·, ·) ∈ ℝ as

S_{n} (β, λ) = {[\nabla_{1} Q_{n} (β; β)]}_{α} - w {(β, λ)}^{⊤} \cdot {[\nabla_{1} Q_{n} (β; β)]}_{γ} .

(2.9)

Here w(β, λ) ∈ ℝ^d−1 is obtained using the following Dantzig selector (Candès and Tao, 2007)

w (β, λ) = \underset{w \in ℝ^{d - 1}}{argmin} {‖ w ‖}_{1}, subject to {‖ {[T_{n} (β)]}_{γ, α} - {[T_{n} (β)]}_{γ, γ} \cdot w ‖}_{\infty} \leq λ,

(2.10)

where λ > 0 is a tuning parameter. Let β̂ = (α̂, γ̂^⊤)^⊤, where β̂ is the estimator attained by the high dimensional EM algorithm (Algorithm 1). We define the decorrelated score statistic as

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) / {- {[T_{n} ({\hat{β}}_{0})]}_{α | γ}}^{1 / 2}, where {\hat{β}}_{0} = {(0, {\hat{γ}}^{⊤})}^{⊤}, and {[T_{n} ({\hat{β}}_{0})]}_{α | γ} = [1, - w {({\hat{β}}_{0}, λ)}^{⊤}] \cdot T_{n} ({\hat{β}}_{0}) \cdot {[1, - w {({\hat{β}}_{0}, λ)}^{⊤}]}^{⊤} .

(2.11)

Here we use β̂₀ instead of β̂ since we are interested in the null hypothesis H₀ : α* = 0. We can also replace β̂₀ with β̂ and the theoretical results will remain the same. In §4 we will prove the proposed decorrelated score statistic in (2.11) is asymptotically N(0, 1). Consequently, the decorrelated score test with significance level δ ∈ (0, 1) takes the form

ψ_{S} (δ) = 𝟙 {\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) / {- {[T_{n} ({\hat{β}}_{0})]}_{α | γ}}^{1 / 2} \notin [- Φ^{- 1} (1 - δ / 2), Φ^{- 1} (1 - δ / 2)]},

where Φ⁻¹(·) is the inverse function of the Gaussian cumulative distribution function. If ψ_S(δ) = 1, we reject the null hypothesis H₀ : α* = 0. Correspondingly, the associated p-value takes the form

p - value = 2 \cdot [1 - Φ (| \sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) / {- {[T_{n} ({\hat{β}}_{0})]}_{α | γ}}^{1 / 2} |)] .

The intuition for the decorrelated score statistic in (2.11) can be understood as follows. Since ℓ_n(β) is the log-likelihood, its score function is ∇ℓ_n(β) and the Fisher information at β* is I(β*) = −𝔼 _β* [∇²ℓ_n(β*)]/n, where 𝔼_β*(·) means the expectation is taken under the model with parameter β*. The following lemma reveals the connection of ∇₁Q_n(·; ·) in (2.9) and T_n(·) in (2.11) with the score function and Fisher information.

Lemma 2.1

For the true parameter β* and any β ∈ ℝ^d, it holds that

\nabla_{1} Q_{n} (β; β) = \nabla ℓ_{n} (β) / n, and 𝔼_{β^{*}} [T_{n} (β^{*})] = - I (β^{*}) = 𝔼_{β^{*}} [\nabla^{2} ℓ_{n} (β^{*})] / n .

(2.12)

Proof

See §C.1 for details.

Recall that the log-likelihood ℓ_n(β) defined in (2.2) is difficult to evaluate due to the unobserved latent variable. Lemma 2.1 provides a feasible way to calculate or estimate the corresponding score function and Fisher information, since Q_n(·; ·) and T_n(·) have closed forms. The geometric intuition behind Lemma 2.1 can be understood as follows. By (2.4) and (2.5) we have

ℓ_{n} (β) \geq ℓ_{n} (β') + n \cdot Q_{n} (β; β') - \sum_{i = 1}^{n} \int_{𝒵} k_{β'} (z | y_{i}) \cdot log f_{β'} (y_{i}, z) d z .

(2.13)

By (2.12), both sides of (2.13) have the same gradient with respect to β at β′ = β. Furthermore, by (2.5), (2.13) becomes an equality for β′ = β. Therefore, the lower bound function on the right-hand side of (2.13) is tangent to ℓ_n(β) at β′ = β. Meanwhile, according to (2.8), T_n(β) defines a modified curvature for the right-hand side of (2.13), which is obtained by taking derivative with respect to β, then setting β′ = β and taking the second order derivative with respect to β. The second equation in (2.12) shows that the obtained curvature equals the curvature of ℓ_n(β) at β = β* in expectation (up to a renormalization factor of n). Therefore, ∇₁Q_n(β; β) gives the score function and T_n(β*) gives a good estimate of the Fisher information I(β*). Since β* is unknown in practice, later we will use T_n(β̂) or T_n(β̂₀) to approximate T_n(β*).

In the presence of the high dimensional nuisance parameter γ* ∈ ℝ^d−1, the classical score test is no longer applicable. In detail, the score test for H₀ : α* = 0 relies on the following Taylor expansion of the score function ∂ℓ_n(·)/∂α

\frac{1}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} ({\bar{β}}_{0})}{\partial α} = \frac{1}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} (β^{*})}{\partial α} + \frac{1}{\sqrt{n}} \cdot \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial α \partial γ} \cdot (\bar{γ} - γ^{*}) + \bar{R} .

(2.14)

Here β* = [0, (γ*)^⊤]^⊤, R̅ denotes the remainder term and β̅₀ = (0, γ̅^⊤)^⊤, where γ̅ is an estimator of the nuisance parameter γ*. The asymptotic normality of $1 / \sqrt{n} \cdot \partial ℓ_{n} ({\bar{β}}_{0}) / \partial α$ in (2.14) relies on the fact that $1 / \sqrt{n} \cdot \partial ℓ_{n} (β_{0}^{*}) / \partial α$ and $\sqrt{n} \cdot (\bar{γ} - γ^{*})$ are jointly normal asymptotically and R̅ is o_ℙ(1). In low dimensional settings, such a necessary condition holds for γ̅ being the maximum likelihood estimator. However, in high dimensional settings, the maximum likelihood estimator can’t guarantee that R̅ is o_ℙ(1), since ‖γ̅ − γ*‖₂ can be large due to the curse of dimensionality. Meanwhile, for γ̅ being sparsity-type estimators, in general the asymptotic normality of $\sqrt{n} \cdot (\bar{γ} - γ^{*})$ doesn’t hold. For example, let γ̅ be γ̂, where γ̂ ∈ ℝ^d−1 is the subvector of β̂, i.e., the estimator attained by the proposed high dimensional EM algorithm. Note that γ̂ has many zero entries due to the truncation step. As n → ∞, some entries of $\sqrt{n} \cdot (\hat{γ} - γ^{*})$ have limiting distributions with point mass at zero. Clearly, this limiting distribution is not Gaussian with nonzero variance. In fact, for a similar setting of high dimensional linear regression, Knight and Fu (2000) illustrate that for γ^♯ being a subvector of the Lasso estimator and γ* being the corresponding subvector of the true parameter, the limiting distribution of $\sqrt{n} \cdot (γ^{♯} - γ^{*})$ is not Gaussian.

The decorrelated score function defined in (2.9) successfully addresses the above issues. In detail, according to (2.12) in Lemma 2.1 we have

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) = \frac{1}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} ({\hat{β}}_{0})}{\partial α} - \frac{1}{\sqrt{n}} \cdot w {({\hat{β}}_{0}, λ)}^{⊤} \cdot \frac{\partial ℓ_{n} ({\hat{β}}_{0})}{\partial γ} .

(2.15)

Intuitively, if we replace w(β̂₀, λ) with w ∈ ℝ^d−1 that satisfies

w^{⊤} \cdot \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial^{2} γ} = \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial α \partial γ},

(2.16)

we have the following Taylor expansion of the decorrelated score function

\frac{1}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} ({\hat{β}}_{0})}{\partial α} - \frac{w^{⊤}}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} ({\hat{β}}_{0})}{\partial γ} = \overset{(i)}{\overset{︷}{\frac{1}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} (β^{*})}{\partial α} - \frac{w^{⊤}}{\sqrt{n}} \cdot \frac{\partial ℓ_{n} (β^{*})}{\partial γ}}} + \underset{(i)}{\underset{︸}{\frac{1}{\sqrt{n}} \cdot \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial α \partial γ} \cdot (\hat{γ} - γ^{*}) - \frac{w^{⊤}}{\sqrt{n}} \cdot \frac{\partial^{2} ℓ_{n} (β^{*})}{\partial^{2} γ} \cdot (\hat{γ} - γ^{*})}} + \tilde{R},

(2.17)

where term (ii) is zero by (2.16). Therefore, we no longer require the asymptotic normality of γ̂ − γ*. Also, we will prove that the new remainder term R̃ in (2.17) is o_ℙ(1), since γ̂ has a fast statistical rate of convergence. Now we only need to find the w that satisfies (2.16). Nevertheless, it is difficult to calculate the second order derivatives in (2.16), because it is hard to evaluate ℓ_n(·). According to (2.12), we use the submatrices of T_n(·) to approximate the derivatives in (2.16). Since [T_n(β)]_γ,γ is not invertible in high dimensions, we use the Dantzig selector in (2.10) to approximately solve the linear system in (2.16). Based on this intuition, we can expect that $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ)$ is asymptotically normal, since term (i) in (2.17) is a (rescaled) average of n i.i.d. random variables for which we can apply the central limit theorem. Besides, we will prove that −[T_n(β̂₀)]_α|γ in (2.11) is a consistent estimator of $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ)$ ’s asymptotic variance. Hence, we can expect that the decorrelated score statistic in (2.11) is asymptotically N(0, 1).

From a high-level perspective, we can view w(β̂₀, λ)^⊤·∂ℓ_n(β̂₀)/∂γ in (2.15) as the projection of ∂ℓ_n(β̂₀)/∂α onto the span of ∂ℓ_n(β̂₀)/∂γ, where w(β̂₀, λ) is the projection coefficient. Intuitively, such a projection guarantees that in (2.15), S_n(β̂₀, λ) is orthogonal (uncorrelated) with ∂ℓ_n(β̂₀)/∂γ, i.e., the score function with respect to the nuisance parameter γ. In this way, the projection corrects the effects of the high dimensional nuisance parameter. According to this intuition of decorrelation, we name S_n(β̂₀, λ) as the decorrelated score function.

Decorrelated Wald Test

Based on the decorrelated score test, we propose the decorrelated Wald test. In detail, recall that T_n(·) is defined in (2.8). Let

\bar{α} (\hat{β}, λ) = \hat{α} - {{[T_{n} (\hat{β})]}_{α, α} - w {(\hat{β}, λ)}^{⊤} \cdot {[T_{n} (\hat{β})]}_{γ, α}}^{- 1} \cdot S_{n} (\hat{β}, λ),

(2.18)

where S_n(·, ·) is the decorrelated score function in (2.9), w(·, ·) is defined in (2.10) and β̂ = (α̂, γ̂^⊤)^⊤ is the estimator obtained by Algorithm 1. For testing the null hypothesis H₀ : α* = 0, we define the decorrelated Wald statistic as

\sqrt{n} \cdot \bar{α} (\hat{β}, λ) \cdot {- {[T_{n} (\hat{β})]}_{α | γ}}^{1 / 2},

(2.19)

where [T_n(β̂)]_α|γ is defined by replacing β̂₀ with β̂ in (2.11). In §4 we will prove that this statistic is asymptotically N(0, 1). Consequently, the decorrelated Wald test with significance level δ ∈ (0, 1) takes the form

ψ_{W} (δ) = 𝟙 {\sqrt{n} \cdot \bar{α} (\hat{β}, λ) \cdot {- {[T_{n} (\hat{β})]}_{α | γ}}^{1 / 2} \notin [- Φ^{- 1} (1 - δ / 2), Φ^{- 1} (1 - δ / 2)]},

where Φ⁻¹(·) is the inverse function of the Gaussian cumulative distribution function. If ψ_W(δ) = 1, we reject the null hypothesis H₀ : α* = 0. The associated p-value takes the form

p - value = 2 \cdot [1 - Φ (| \sqrt{n} \cdot \bar{α} (\hat{β}, λ) \cdot {- {[T_{n} (\hat{β})]}_{α | γ}}^{1 / 2} |)] .

In more general settings where α* is possibly nonzero, in §4.1 we will prove that $\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}]$ . {−[T_n(β̂)]_α|γ}^1/2 is asymptotically N(0, 1). Hence, we construct the two-sided confidence interval for α* with confidence level 1 − δ as

[\bar{α} (\hat{β}, λ) - \frac{Φ^{- 1} (1 - δ / 2)}{\sqrt{- n \cdot {[T_{n} (\hat{β})]}_{α | γ}}}, \bar{α} (\hat{β}, λ) + \frac{Φ^{- 1} (1 - δ / 2)}{\sqrt{- n \cdot {[T_{n} (\hat{β})]}_{α | γ}}}] .

(2.20)

The intuition for the decorrelated Wald statistic defined in (2.19) can be understood as follows. For notational simplicity, we define

{\bar{S}}_{n} (β, ŵ) = {[\nabla_{1} Q_{n} (β; β)]}_{α} - ŵ \cdot {[\nabla_{1} Q_{n} (β; β)]}_{γ}, where ŵ = w (\hat{β}, λ) .

(2.21)

By the definitions in (2.9) and (2.21), we have S̅_n(β̂, ŵ) = S_n(β̂, λ). According to the same intuition for the asymptotic normality of $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ)$ in the decorrelated score test, we can easily establish the asymptotic normality of $\sqrt{n} \cdot S_{n} (\hat{β}, λ) = \sqrt{n} \cdot {\bar{S}}_{n} (\hat{β}, ŵ)$ . Based on the proof for the classical Wald test (van der Vaart, 2000), we can further establish the asymptotic normality of $\sqrt{n} \cdot \underline{α}$ , where α̱ is defined as the solution to

{\bar{S}}_{n} [{(α, {\hat{γ}}^{⊤})}^{⊤}, ŵ] = 0 .

(2.22)

Nevertheless, it is difficult to calculate α̱. Instead, we consider the first order Taylor approximation

{\bar{S}}_{n} (\hat{β}, ŵ) + (α - \hat{α}) \cdot {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α} = 0, where \hat{β} = {(\hat{α}, {\hat{γ}}^{⊤})}^{⊤} .

(2.23)

Here ŵ is defined in (2.21) and the gradient is taken with respect to β in (2.21). According to (2.8) and (2.21), we have that in (2.23),

{[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α} = {[\nabla_{1, 1}^{2} Q_{n} (\hat{β}; \hat{β}) + \nabla_{1, 2}^{2} Q_{n} (\hat{β}; \hat{β})]}_{α, α} - ŵ^{⊤} \cdot {[\nabla_{1, 1}^{2} Q_{n} (\hat{β}; \hat{β}) + \nabla_{1, 2}^{2} Q_{n} (\hat{β}; \hat{β})]}_{γ, α} = {[T_{n} (\hat{β})]}_{α, α} - ŵ^{⊤} \cdot {[T_{n} (\hat{β})]}_{γ, α}, where ŵ = w (\hat{β}, λ) .

Hence, α̅(β̂, λ) defined in (2.18) is the solution to (2.23). Alternatively, we can view α̅(β̂, λ) as the output of one Newton-Raphson iteration applied to α̂. Since (2.23) approximates (2.22), intuitively α̅(β̂, λ) has similar statistical properties as α̱, i.e., the solution to (2.22). Therefore, we can expect that $\sqrt{n} \cdot \bar{α} (\hat{β}, λ)$ is asymptotically normal. Besides, we will prove that $- {[T_{n} ({\hat{β}}_{0})]}_{α | γ}^{- 1}$ is a consistent estimator of the asymptotic variance of $\sqrt{n} \cdot \bar{α} (\hat{β}, λ)$ . Thus, the decorrelated Wald statistic in (2.19) is asymptotically N(0, 1). It is worth noting that, although the intuition of the decorrelated Wald statistic is the same as the one-step estimator proposed by Bickel (1975), here we don’t require the $\sqrt{n}$ -consistency of the initial estimator α̂ to achieve the asymptotic normality of α̅(β̂, λ).

2.3 Applications to Latent Variable Models

In the sequel, we introduce three latent variable models as examples. To apply the high dimensional EM algorithm in §2.1 and the methods for asymptotic inference in §2.2, we only need to specify the forms of Q_n(·; ·) defined in (2.5), M_n(·) in Algorithms 2 and 3, and T_n(·) in (2.8) for each model.

Gaussian Mixture Model

Let y₁, …, y_n be the n i.i.d. realizations of Y ∈ ℝ^d and

Y = Z \cdot β^{*} + V .

(2.24)

Here Z is a Rademacher random variable, i.e., ℙ(Z = +1) = ℙ(Z = −1) = 1/2, and V ~ N(0, σ² ·I_d) is independent of Z, where σ is the standard deviation. We suppose σ is known. In high dimensional settings, we assume that β* ∈ ℝ^d is sparse. To avoid the degenerate case in which the two Gaussians in the mixture are identical, here we suppose that β* ≠ 0. See §A.1 for the detailed forms of Q_n(·; ·), M_n(·) and T_n(·).

Mixture of Regression Model

We assume that Y ∈ ℝ and X ∈ ℝ^d satisfy

Y = Z \cdot X^{⊤} β^{*} + V,

(2.25)

where X ~ N(0, I_d), V ~ N(0, σ²) and Z is a Rademacher random variable. Here X, V and Z are independent. In the high dimensional regime, we assume β* ∈ ℝ^d is sparse. To avoid the degenerate case, we suppose β* ≠ 0. In addition, we assume that σ is known. See §A.2 for the detailed forms of Q_n(·; ·), M_n(·) and T_n(·).

Regression with Missing Covariates

We assume that Y ∈ ℝ and X ∈ ℝ^d satisfy

Y = X^{⊤} β^{*} + V,

(2.26)

where X ~ N(0, I_d) and V ~ N(0, σ²) are independent. We assume β* is sparse. Let x₁, …, x_n be the n realizations of X. We assume that each coordinate of x_i is missing (unobserved) independently with probability p_m ∈ [0, 1). We treat the missing covariates as the latent variable and suppose that σ is known. See §A.3 for the detailed forms of Q_n(·; ·), M_n(·) and T_n(·).

3 Theory of Computation and Estimation

Before we lay out the main results, we introduce three technical conditions, which will significantly simplify our presentation. These conditions will be verified for the two implementations of the high dimensional EM algorithm and three latent variable models.

The first two conditions, proposed by Balakrishnan et al. (2014), characterize the properties of the population version lower bound function Q(·; ·), i.e., the expectation of Q_n(·; ·) defined in (2.5). We define the respective population version M-step as follows. For the maximization implementation of the M-step (Algorithm 2), we define

M (β) = \underset{β'}{argmax} Q (β'; β) .

(3.1)

For the gradient ascent implementation of the M-step (Algorithm 3), we define

M (β) = β + η \cdot \nabla_{1} Q (β; β),

(3.2)

where η > 0 is the stepsize in Algorithm 3. Hereafter, we employ ℬ to denote the basin of attraction, i.e., the local region in which the proposed high dimensional EM algorithm enjoys desired statistical and computational guarantees.

Condition 3.1

We define two versions of this condition.

Lipschitz-Gradient-1(γ₁, ℬ). For the true parameter β* and any β ∈ ℬ, we have
${‖ \nabla_{1} Q [M (β); β^{*}] - \nabla_{1} Q [M (β); β] ‖}_{2} \leq γ_{1} \cdot {‖ β - β^{*} ‖}_{2},$ (3.3)
where M(·) is the population version M-step (maximization implementation) defined in (3.1).
Lipschitz-Gradient-2(γ₂, ℬ). For the true parameter β* and any β ∈ ℬ, we have
${‖ \nabla_{1} Q (β; β^{*}) - \nabla_{1} Q (β; β) ‖}_{2} \leq γ_{2} \cdot {‖ β - β^{*} ‖}_{2} .$ (3.4)

Condition 3.1 defines a variant of Lipschitz continuity for ∇₁Q(·; ·). Note that in (3.3) and (3.4), the gradient is taken with respect to the first variable of Q(·; ·). Meanwhile, the Lipschitz continuity is with respect to the second variable of Q(·; ·). Also, the Lipschitz property is defined only between the true parameter β* and an arbitrary β ∈ ℬ, rather than between two arbitrary β’s. In the sequel, we will use (3.3) and (3.4) in the analysis of the two implementations of the M-step respectively.

Condition 3.2

Concavity-Smoothness(μ, ν, ℬ). For any β₁, β₂ ∈ ℬ, Q(·; β*) is μ-smooth, i.e.,

Q (β_{1}; β^{*}) \geq Q (β_{2}; β^{*}) + {(β_{1} - β_{2})}^{⊤} \cdot \nabla_{1} Q (β_{2}; β^{*}) - μ / 2 \cdot {‖ β_{2} - β_{1} ‖}_{2}^{2},

(3.5)

and ν-strongly concave, i.e.,

Q (β_{1}; β^{*}) \leq Q (β_{2}; β^{*}) + {(β_{1} - β_{2})}^{⊤} \cdot \nabla_{1} Q (β_{2}; β^{*}) - ν / 2 \cdot {‖ β_{2} - β_{1} ‖}_{2}^{2} .

(3.6)

This condition indicates that, when the second variable of Q(·; ·) is fixed to be β*, the function is ‘sandwiched’ between two quadratic functions. Conditions 3.1 and 3.2 are essential to establishing the desired optimization results. In detail, Condition 3.2 ensures that standard convex optimization results for strongly convex and smooth objective functions, e.g., in Nesterov (2004), can be applied to −Q(·; β*). Since our analysis will not only involve Q(·; β*), but also Q(·; β) for all β ∈ ℬ, we also need to quantify the difference between Q(·; β*) and Q(·; β) by Condition 3.1. It suggests that, this difference can be controlled in the sense that ∇₁Q(·; β) is Lipschitz with respect to β. Consequently, for any β ∈ ℬ, the behavior of Q(·; β) mimics that of Q(·; β*). Hence, standard convex optimization results can be adapted to analyze −Q(·; β) for any β ∈ ℬ.

The third condition characterizes the statistical error between the sample version and population version M-steps, i.e., M_n(·) defined in Algorithms 2 and 3, and M(·) in (3.1) and (3.2). Recall that ‖ · ‖₀ denotes the total number of nonzero entries in a vector.

Condition 3.3

Statistical-Error(ε, δ, s, n, ℬ). For any fixed β ∈ ℬ with ‖β‖₀ ≤ s, we have that

{‖ M (β) - M_{n} (β) ‖}_{\infty} \leq ε

(3.7)

holds with probability at least 1 − δ. Here ε > 0 possibly depends on δ, sparsity level s, sample size n, dimension d, as well as the basin of attraction ℬ.

In (3.7), the statistical error ε quantifies the ℓ_∞-norm of the difference between the population version and sample version M-steps. Particularly, we constrain the input β of M(·) and M_n(·) to be s-sparse. Such a condition is different from the one used by Balakrishnan et al. (2014). In detail, they quantify the statistical error with the ℓ₂-norm and don’t constrain the input of M(·) and M_n(·) to be sparse. Consequently, our subsequent statistical analysis differs from theirs. The reason we use the ℓ_∞-norm is that, it characterizes the more refined entrywise statistical error, which converges at a fast rate of $\sqrt{log d / n}$ (possibly with extra factors depending on specific models). In comparison, the ℓ₂-norm statistical error converges at a slow rate of $\sqrt{d / n}$ , which doesn’t decrease to zero as n increases when d≫n. Moreover, the fine-grained entrywise statistical error is crucial to quantifying the effects of the truncation step (line 6 of Algorithm 1) on the iterative solution sequence.

3.1 Main Results

Equipped with Conditions 3.1–3.3, we now lay out the computational and statistical results for the high dimensional EM algorithm. To simplify the technical analysis of the algorithm, we focus on its resampling version, which is illustrated in Algorithm 4.

Algorithm 4.

High Dimensional EM Algorithm with Resampling.

1:	Parameter: Sparsity Parameter ŝ, Maximum Number of Iterations T
2:	Initialization: 𝒮̂^init ← supp(β^init, ŝ), β⁽⁰⁾ ← trunc(β^init, 𝒮̂^init),
	{supp(·, ·) and trunc(·, ·) are defined in (2.6) and (2.7)}
	Split the Dataset into T Subsets of Size n/T
	{Without loss of generality, we assume n/T is an integer}
3:	For t = 0 to T − 1
4:	E-step: Evaluate Q_n/T (β; β^(t)) with the t-th Data Subset
5:	M-step: β^(t+0.5) ← M_n/T(β^(t))
	{M_n/T(·) is implemented as in Algorithm 2 or 3 with Q_n(·; ·) replaced by Q_n/T(·; ·)}
6:	T-step: 𝒮̂^(t+0.5) ← supp(β^(t+0.5), ŝ), β^(t+1) ← trunc(β^(t+0.5), 𝒮̂^(t+0.5))
7:	End For
8:	Output: β̂ ← β^(T)

Open in a new tab

Theorem 3.4

For Algorithm 4, we define ℬ = {β : ‖β − β*‖₂ ≤ R}, where R = κ·‖β*‖₂ for some κ∈(0, 1). We assume that Condition Concavity-Smoothness(μ, ν, ℬ) holds and ‖β^init − β*‖₂ ≤ R/2.

For the maximization implementation of the M-step (Algorithm 2), we suppose that Condition Lipschitz-Gradient-1(γ₁, ℬ) holds with γ₁/ν ∈ (0, 1) and
$ŝ = ⌈ C \cdot max {\frac{16}{{(ν / γ_{1} - 1)}^{2}}, \frac{4 \cdot {(1 + κ)}^{2}}{{(1 - κ)}^{2}}} \cdot s^{*} ⌉,$ (3.8)

$(\sqrt{ŝ} + C' / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε \leq min {{(1 - \sqrt{γ_{1} / ν})}^{2} \cdot R, \frac{{(1 - κ)}^{2}}{2 \cdot (1 + κ)} \cdot {‖ β^{*} ‖}_{2}},$ (3.9)
where C ≥ 1 and C′ > 0 are absolute constants. Under Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) we have that, for t = 1, …, T,
${‖ β^{(t)} - β^{*} ‖}_{2} \leq \underset{Optimization Error}{\underset{︸}{{(γ_{1} / ν)}^{t / 2} \cdot R}} + \underset{Statistical Error}{\underset{︸}{\frac{(\sqrt{ŝ} + C' / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε}{1 - \sqrt{γ_{1} / ν}}}}$ (3.10)
holds with probability at least 1 − δ, where C′ is the same constant as in (3.9).
For the gradient ascent implementation of the M-step (Algorithm 3), we suppose that Condition Lipschitz-Gradient-2(γ₂, ℬ) holds with 1 − 2 · (ν − γ₂)/(ν +μ) ∈ (0, 1) and the stepsize in Algorithm 3 is set to η = 2/(ν + μ). Meanwhile, we assume that
$ŝ = ⌈ C \cdot max {\frac{16}{{1 / [1 - 2 \cdot (ν - γ_{2}) / (ν + μ)] - 1}^{2}}, \frac{4 \cdot {(1 + κ)}^{2}}{{(1 - κ)}^{2}}} \cdot s^{*} ⌉,$ (3.11)

$(\sqrt{ŝ} + C' / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε \leq min {{(1 - \sqrt{1 - 2 \cdot \frac{ν - γ_{2}}{ν + μ}})}^{2} \cdot R, \frac{{(1 - κ)}^{2}}{2 \cdot (1 + κ)} \cdot {‖ β^{*} ‖}_{2}},$ (3.12)
where C ≥ 1 and C′ > 0 are absolute constants. Under Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) we have that, for t = 1, …, T,
${‖ β^{(t)} - β^{*} ‖}_{2} \leq \underset{Optimization Error}{\underset{︸}{{(1 - 2 \cdot \frac{ν - γ_{2}}{ν + μ})}^{t / 2} \cdot R}} + \underset{Statistical Error}{\underset{︸}{\frac{(\sqrt{ŝ} + C' / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε}{1 - \sqrt{1 - 2 \cdot (ν - γ_{2}) / ν + μ}}}}$ (3.13)
holds with probability at least 1 − δ, where C′ is the same constant as in (3.12).

Proof

See §5.1 for a detailed proof.

The assumptions in (3.8) and (3.11) state that the sparsity parameter ŝ in Algorithm 4 is chosen to be sufficiently large and also of the same order as the true sparsity level s*. These assumptions, which will be used by Lemma 5.1 in the proof of Theorem 3.4, ensure that the error incurred by the truncation step can be upper bounded. In addition, as will be shown for specific models in §3.2, the error term ε in Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) decreases as sample size n increases. By the assumptions in (3.8) and (3.11), $(\sqrt{ŝ} + C' / \sqrt{1 - κ} \cdot \sqrt{s^{*}})$ is of the same order as $\sqrt{s^{*}}$ . Therefore, the assumptions in (3.9) and (3.12) suggest the sample size n is sufficiently large such that $\sqrt{s^{*}} \cdot ε$ is sufficiently small. These assumptions guarantee that the entire iterative solution sequence remains within the basin of attraction ℬ in the presence of statistical error. The assumptions in (3.8), (3.9), (3.11), (3.12) will be more explicit as we specify the values of γ₁, γ₂, ν, μ and κ for specific models.

Theorem 3.4 illustrates that, the upper bound of the overall estimation error can be decomposed into two terms. The first term is the upper bound of optimization error, which decreases to zero at a geometric rate of convergence, because we have γ₁/ν < 1 in (3.10) and 1 − 2 · (ν − γ₂)/(ν + μ) < 1 in (3.13). Meanwhile, the second term is the upper bound of statistical error, which doesn’t depend on t. Since $(\sqrt{ŝ} + C' / \sqrt{1 - κ} \cdot \sqrt{s^{*}})$ is of the same order as $\sqrt{s^{*}}$ , this term is proportional to $\sqrt{s^{*}}$ ·ε, where ε is the entrywise statistical error between M(·) and M_n(·). We will prove that, for a variety of models and the two implementations of the M-step, ε is roughly of the order $\sqrt{log d / n}$ . (There may be extra factors attached to ε depending on each specific model.) Hence, the statistical error term is roughly of the order $\sqrt{s^{*} \cdot log d / n}$ . Consequently, for a sufficiently large t = T such that the optimization and statistical error terms in (3.10) or (3.13) are of the same order, the final estimator β̂ = β^(T) attains a (near-)optimal $\sqrt{s^{*} \cdot log d / n}$ (possibly with extra factors) rate of convergence for a broad variety of high dimensional latent variable models.

3.2 Implications for Specific Models

To establish the corresponding results for specific models under the unified framework, we only need to establish Conditions 3.1–3.3 and determine the key quantities R, γ₁, γ₂, ν, μ, κ and ε. Recall that Conditions 3.1 and 3.2 and the models analyzed in our paper are identical to those in Balakrishnan et al. (2014). Meanwhile, note that Conditions 3.1 and 3.2 only involve the population version lower bound function Q(·; ·) and M-step M(·). Since Balakrishnan et al. (2014) prove that the quantities in Conditions 3.1 and 3.2 are independent of the dimension d and sample size n, their corresponding results can be directly adapted. To establish the final results, it still remains to verify Condition 3.3 for each high dimensional latent variable model.

Gaussian Mixture Model

The following lemma, which is proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for Gaussian mixture model. Recall that σ is the standard deviation of each individual Gaussian distribution within the mixture.

Lemma 3.5

Suppose we have ‖β*‖₂/σ ≥ r, where r > 0 is a sufficiently large constant that denotes the minimum signal-to-noise ratio. There exists some absolute constant C > 0 such that Conditions Lipschitz-Gradient-1(γ₁, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with

γ_{1} = exp (- C \cdot r^{2}), μ = ν = 1, ℬ = {β : {‖ β - β^{*} ‖}_{2} \leq R} with R = κ \cdot {‖ β^{*} ‖}_{2}, κ = 1 / 4 .

(3.14)

Proof

See the proof of Corollary 1 in Balakrishnan et al. (2014) for details.

Now we verify Condition 3.3 for the maximization implementation of the M-step (Algorithm 2).

Lemma 3.6

For the maximization implementation of the M-step (Algorithm 2), we have that for a sufficiently large n and ℬ specified in (3.14), Condition Statistical-Error(ε, δ, s, n, ℬ) holds with

ε = C \cdot ({‖ β^{*} ‖}_{\infty} + σ) \cdot \sqrt{\frac{log d + log (2 / δ)}{n}} .

(3.15)

Proof

See §B.4 for a detailed proof.

The next theorem establishes the implication of Theorem 3.4 for Gaussian mixture model.

Theorem 3.7

We consider the maximization implementation of M-step (Algorithm 2). We assume ‖β*‖₂/σ ≥ r holds with a sufficiently large r > 0, and ℬ and R are as defined in (3.14). We suppose the initialization β^init of Algorithm 4 satisfies ‖β^init − β*‖₂ ≤ R/2. Let the sparsity parameter ŝ be

ŝ = ⌈ C' \cdot max {16 \cdot {[exp (C \cdot r^{2}) - 1]}^{- 2}, 100 / 9} \cdot s^{*} ⌉

(3.16)

with C specified in (3.14) and C′ ≥ 1. Let the total number of iterations T in Algorithm 4 be

T = ⌈ \frac{log {C' \cdot R / [Δ^{G M M} (s^{*}) \cdot \sqrt{log d / n}]}}{C \cdot r^{2} / 2} ⌉, where Δ^{G M M} (s^{*}) = (\sqrt{ŝ} + C ″ \cdot \sqrt{s^{*}}) \cdot ({‖ β^{*} ‖}_{\infty} + σ) .

(3.17)

Meanwhile, we suppose the dimension d is sufficiently large such that T in (3.17) is upper bounded by $\sqrt{d}$ , and the sample size n is sufficiently large such that

C' \cdot Δ^{G M M} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}} \leq min {{[1 - exp (- C \cdot r^{2} / 2)]}^{2} \cdot R, 9 / 40 \cdot {‖ β^{*} ‖}_{2}} .

(3.18)

We have that, with probability at least 1 − 2 · d^−1/2, the final estimator β̂ = β^(T) satisfies

{‖ \hat{β} - β^{*} ‖}_{2} \leq \frac{C' \cdot Δ^{G M M} (s^{*})}{1 - exp (- C \cdot r^{2} / 2)} \cdot \sqrt{\frac{log d \cdot T}{n}} .

(3.19)

Proof

First we plug the quantities in (3.14) and (3.15) into Theorem 3.4. Recall that Theorem 3.4 requires Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ). Thus we need to replace δ and n with δ/T and n/T in the definition of ε in (3.15). Then we set δ = 2 · d^−1/2. Since T is specified in (3.17) and the dimension d is sufficiently large such that $T \leq \sqrt{d}$ , we have log [2/(δ/T)] ≤ log d in the definition of ε. By (3.16) and (3.18), we can then verify the assumptions in (3.8) and (3.9). Finally, by plugging in T in (3.17) into (3.10) and taking t = T, we can verify that in (3.9) the optimization error term is smaller than the statistical error term up to a constant factor. Therefore, we obtain (3.19).

To see the statistical rate of convergence with respect to s*, d and n, for the moment we assume that R, r, ‖β*‖_∞, ‖β*‖₂ and σ are absolute constants. From (3.16) and (3.17), we obtain ŝ = C·s* and therefore $Δ^{G M M} (s^{*}) = C' \cdot \sqrt{s^{*}}$ , which implies $T = C ‴ \cdot log [C ″ \cdot \sqrt{n / (s^{*} \cdot log d)}]$ . Hence, according to (3.19) we have that, with high probaibility,

{‖ \hat{β} - β^{*} ‖}_{2} \leq C \cdot \sqrt{\frac{s^{*} \cdot log d \cdot log n}{n}} .

Because the minimax lower bound for estimating an s*-sparse d-dimensional vector is $\sqrt{s^{*} \cdot log d / n}$ , the rate of convergence in (3.19) is optimal up to a factor of log n. Such a logarithmic factor results from the resampling scheme in Algorithm 4, since we only utilize n/T samples within each iteration. We expect that by directly analyzing Algorithm 1 we can eliminate such a logarithmic factor, which however incurs extra technical complexity for the analysis.

Mixture of Regression Model

The next lemma, proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for mixture of regression model. Recall that β* is the regression coefficient and σ is the standard deviation of the Gaussian noise.

Lemma 3.8

Suppose ‖β*‖₂/σ ≥ r, where r > 0 is a sufficiently large constant that denotes the required minimum signal-to-noise ratio. Conditions Lipschitz-Gradient-1(γ₁, ℬ), Lipschitz-Gradient-2(γ₂, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with

γ_{1} \in (0, 1 / 2), γ_{2} \in (0, 1 / 4), μ = ν = 1,

ℬ = {β : {‖ β - β^{*} ‖}_{2} \leq R} with R = κ \cdot {‖ β^{*} ‖}_{2}, κ = 1 / 32 .

(3.20)

Proof

See the proof of Corollary 3 in Balakrishnan et al. (2014) for details.

The following lemma establishes Condition 3.3 for the two implementations of the M-step.

Lemma 3.9

For ℬ specified in (3.20), we have the following results.

For the maximization implementation of the M-step (line 5 of Algorithm 4), we have that Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
$ε = C \cdot [max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} + {‖ β^{*} ‖}_{2}] \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}$ (3.21)
for sufficiently large sample size n and absolute constant C > 0.
For the gradient ascent implementation, Condition Statistical-Error(ε, δ, s, n, ℬ) holds with
$ε = C \cdot η \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1, \sqrt{s} \cdot {‖ β^{*} ‖}_{2}} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}$ (3.22)
for sufficiently large sample size n and C > 0, where η denotes the stepsize in Algorithm 3.

Proof

See §B.5 for a detailed proof.

The next theorem establishes the implication of Theorem 3.4 for mixture of regression model.

Theorem 3.10

Let ‖β*‖₂/σ ≥ r with r > 0 sufficiently large. Assuming that ℬ and R are specified in (3.20) and the initialization β^init satisfies ‖β^init − β*‖₂ ≤ R/2, we have the following results.

For the maximization implementation of the M-step (Algorithm 2), let ŝ and T be
$ŝ = ⌈ C \cdot max {16, 132 / 31} \cdot s^{*} ⌉, T = ⌈ \frac{log {C' \cdot R / [Δ_{1}^{M R} (s^{*}) \cdot \sqrt{log d / n}]}}{log \sqrt{2}} ⌉, where Δ_{1}^{M R} (s^{*}) = (\sqrt{ŝ} + C ″ \cdot \sqrt{s^{*}}) \cdot [max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} + {‖ β^{*} ‖}_{2}], and C \geq 1 .$
We suppose d and n are sufficiently large such that $T \leq \sqrt{d}$ and
$C \cdot Δ_{1}^{M R} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}} \leq min {{(1 - 1 / \sqrt{2})}^{2} \cdot R, 3 / 8 \cdot {‖ β^{*} ‖}_{2}} .$
Then with probability at least 1 − 4 · d^−1/2, the final estimator β̂ = β^(T) satisfies
${‖ \hat{β} - β^{*} ‖}_{2} \leq C' \cdot Δ_{1}^{M R} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}} .$ (3.23)
For the gradient ascent implementation of the M-step (Algorithm 3) with stepsize set to η = 1, let ŝ and T be
$ŝ = ⌈ C \cdot max {16 / 9, 132 / 31} \cdot s^{*} ⌉, T = ⌈ \frac{log {C' \cdot R / [Δ_{2}^{M R} (s^{*}) \cdot \sqrt{log d / n}]}}{log 2} ⌉, where Δ_{2}^{M R} (s^{*}) = (\sqrt{ŝ} + C ″ \cdot \sqrt{s^{*}}) \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1, \sqrt{ŝ} \cdot {‖ β^{*} ‖}_{2}}, and C \geq 1 .$
We suppose d and n are sufficiently large such that $T \leq \sqrt{d}$ and
$C \cdot Δ_{2}^{M R} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}} \leq min {R / 4, 3 / 8 \cdot {‖ β^{*} ‖}_{2}} .$
Then with probability at least 1 − 4 · d^−1/2, the final estimator β̂ = β^(T) satisfies
${‖ \hat{β} - β^{*} ‖}_{2} \leq C' \cdot Δ_{2}^{M R} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}} .$ (3.24)

Proof

The proof is similar to Theorem 3.7.

To understand the intuition of Theorem 3.10, we suppose that ‖β*‖₂, σ, R and r are absolute constants, which implies ŝ = C·s* and $Δ_{1}^{M R} (s^{*}) = C' \cdot \sqrt{s^{*}}, Δ_{2}^{M R} (s^{*}) = C ″ \cdot s^{*}$ . Therefore, for the maximization and gradient ascent implementations of the M-step, we have T = C′·log [n/(s*·log d)] and T = C″·log {n/[(s*)²·log d]} correspondingly. Hence, by (3.23) in Theorem 3.10 we have that, for the maximization implementation of the M-step,

{‖ \hat{β} - β^{*} ‖}_{2} \leq C \cdot \sqrt{\frac{s^{*} \cdot log d \cdot log n}{n}}

(3.25)

holds with high probability. Meanwhile, from (3.24) in Theorem 3.10 we have that, for the gradient ascent implementation of the M-step,

{‖ \hat{β} - β^{*} ‖}_{2} \leq C' \cdot s^{*} \cdot \sqrt{\frac{log d \cdot log n}{n}}

(3.26)

holds with high probability. The statistical rates in (3.25) and (3.26) attain the $\sqrt{s^{*} \cdot log d / n}$ minimax lower bound up to factors of $\sqrt{log n}$ and $\sqrt{s^{*} \cdot log n}$ respectively and are therefore near-optimal. Note that the statistical rate of convergence attained by the gradient ascent implementation of the M-step is slower by a $\sqrt{s^{*}}$ factor than the rate of the maximization implementation. However, our discussion in §A.2 illustrates that, for mixture of regression model, the gradient ascent implementation doesn’t involve estimating the inverse covariance of X in (2.25). Hence, the gradient ascent implementation is more computationally efficient, and is also applicable to the settings in which X has more general covariance structures.

Regression with Missing Covariates

Recall that we consider the linear regression model with covariates missing completely at random, which is defined in §2.3. The next lemma, which is proved by Balakrishnan et al. (2014), verifies Conditions 3.1 and 3.2 for this model. Recall that p_m denotes the probability that each covariate is missing and σ is the standard deviation of the Gaussian noise.

Lemma 3.11

Suppose ‖β*‖₂/σ ≤ r, where r > 0 is a constant that denotes the required maximum signal-to-noise ratio. Assuming that we have

p_{m} < 1 / (1 + 2 \cdot b + 2 \cdot b^{2}), where b = r^{2} \cdot {(1 + κ)}^{2},

for some constant κ ∈ (0, 1), then we have that Conditions Lipschitz-Gradient-2(γ₂, ℬ) and Concavity-Smoothness(μ, ν, ℬ) hold with

γ_{2} = \frac{b + p_{m} \cdot (1 + 2 \cdot b + 2 \cdot b^{2})}{1 + b} < 1, μ = ν = 1,

(3.27)

ℬ = {β : {‖ β - β^{*} ‖}_{2} \leq R} with R = κ \cdot {‖ β^{*} ‖}_{2} .

Proof

See the proof of Corollary 6 in Balakrishnan et al. (2014) for details.

The next lemma proves Condition 3.3 for the gradient ascent implementation of the M-step.

Lemma 3.12

Suppose ℬ is defined in (3.27) and ‖β*‖₂/σ ≤ r. For the gradient ascent implementation of the M-step (Algorithm 3), Condition Statistical-Error(ε, δ, s, n, ℬ) holds with

ε = C \cdot η \cdot [\sqrt{s} \cdot {‖ β^{*} ‖}_{2} \cdot (1 + κ) \cdot {(1 + κ \cdot r)}^{2} + max {{(1 + κ \cdot r)}^{2}, σ^{2} + {‖ β^{*} ‖}_{2}^{2}}] \cdot \sqrt{\frac{log d + log (12 / δ)}{n}}

(3.28)

for sufficiently large sample size n and C > 0. Here η denotes the stepsize in Algorithm 3.

Proof

See §B.6 for a detailed proof.

Next we establish the implication of Theorem 3.4 for regression with missing covariates.

Theorem 3.13

We consider the gradient ascent implementation of M-step (Algorithm 3) in which η = 1. Under the assumptions of Lemma 3.11, we suppose the sparsity parameter ŝ takes the form of (3.11) and the initialization β^init satisfies ‖β^init − β*‖₂ ≤ R/2 with ν, μ, γ₂, κ and R specified in (3.27). For r > 0 specified in Lemma 3.11, let the total number of iterations T in Algorithm 4 be

T = ⌈ log {C \cdot R / [Δ^{R M C} (s^{*}) \cdot \sqrt{log d / n}]} / log (\sqrt{1 / γ_{2}}) ⌉, where Δ^{R M C} (s^{*}) = (\sqrt{ŝ} + \frac{C' \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}}) \cdot [\sqrt{ŝ} \cdot {‖ β^{*} ‖}_{2} \cdot (1 + κ) \cdot {(1 + κ \cdot r)}^{2} + max {{(1 + κ \cdot r)}^{2}, σ^{2} + {‖ β^{*} ‖}_{2}^{2}}] .

(3.29)

We assume d and n are sufficiently large such that $T \leq \sqrt{d}$ and

C \cdot Δ^{R M C} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}} \leq min {{(1 - \sqrt{γ_{2}})}^{2} \cdot R, {(1 - κ)}^{2} / [2 \cdot (1 + κ)] \cdot {‖ β^{*} ‖}_{2}} .

We have that, with probability at least 1 − 12 · d^−1/2, the final estimator β̂ = β^(T) satisfies

{‖ \hat{β} - β^{*} ‖}_{2} \leq \frac{C' \cdot Δ^{R M C} (s^{*})}{1 - \sqrt{γ_{2}}} \cdot \sqrt{\frac{log d \cdot T}{n}} .

(3.30)

Proof

The proof is similar to Theorem 3.7.

Assuming r, p_m and κ are absolute constants, we have Δ^RMC(s*) = C·s* in (3.29) since ŝ = C′·s*. Hence, we obtain T = C″·log{n/[(s*)²·log d]}. By (3.30) we have that, with high probability,

{‖ \hat{β} - β^{*} ‖}_{2} \leq C \cdot s^{*} \cdot \sqrt{\frac{log d \cdot log n}{n}},

which is near-optimal with respect to the $\sqrt{s^{*} \cdot log d / n}$ minimax lower bound. It is worth noting that the assumption ‖β*‖₂/σ ≤ r in Lemma 3.12 requires the signal-to-noise ratio to be upper bounded rather than lower bounded, which differs from the assumptions for the previous two models. Such a counter-intuitive phenomenon is explained by Loh and Wainwright (2012).

4 Theory of Inference

To simplify the presentation of the unified framework, we first lay out several technical conditions, which will later be verified for each model. Some of the notations used in this section are introduced in §2.2. The first condition characterizes the statistical rate of convergence of the estimator attained by the high dimensional EM algorithm (Algorithm 4).

Condition 4.1 `Parameter-Estimation` (ζ^EM)

It holds that

{‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (ζ^{E M}),

where ζ^EM scales with s*, d and n.

Since both β̂ and β* are sparse, we can verify this condition for each model based on the ℓ₂-norm recovery results in Theorems 3.7, 3.10 and 3.13. The second condition quantifies the statistical error between the gradients of Q_n(β*; β*) and Q(β*; β*).

Condition 4.2 `Gradient-Statistical-Error` (ζ^G)

For the true parameter β*, it holds that

{‖ \nabla_{1} Q_{n} (β^{*}; β^{*}) - \nabla_{1} Q (β^{*}; β^{*}) ‖}_{\infty} = O_{ℙ} (ζ^{G}),

where ζ^G scales with s*, d and n.

Note that for the gradient ascent implementation of the M-step (Algorithm 3) and its population version defined in (3.2), we have

{‖ M_{n} (β^{*}) - M (β^{*}) ‖}_{\infty} = η \cdot {‖ \nabla_{1} Q_{n} (β^{*}; β^{*}) - \nabla_{1} Q (β^{*}; β^{*}) ‖}_{\infty} .

Thus, we can verify Condition 4.2 for each model following the proof of Lemmas 3.6, 3.9 and 3.12. Recall that T_n(·) is defined in (2.8). The following condition quantifies the difference between T_n(β*) and its expectation. Recall that ‖·‖_∞,∞ denotes the maximum magnitude of all entries in a matrix.

Condition 4.3 T_n(·)-`Concentration` (ζ^T)

For the true parameter β*, it holds that

{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{ℙ} (ζ^{T}),

where ζ^T scales with d and n.

By Lemma 2.1 we have 𝔼_β* [T_n(β*)] = −I(β*). Hence, Condition 4.3 characterizes the statistical rate of convergence of T_n(β*) for estimating the Fisher information. Since β* is unknown in practice, we use T_n(β̂) or T_n(β̂₀) to approximate T_n(β*), where β̂₀ is defined in (2.11). The next condition quantifies the accuracy of this approximation.

Condition 4.4 T_n(·)-`Lipschitz`(ζ^L)

For the true parameter β* and any β, we have

{‖ T_{n} (β) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L}) \cdot {‖ β - β^{*} ‖}_{1},

where ζ^L scales with d and n.

Condition 4.4 characterizes the Lipschitz continuity of T_n(·). We consider β = β̂. Since Condition 4.1 ensures ‖β̂−β*‖₁ converges to zero at the rate of ζ^EM, Condition 4.4 implies T_n(β̂) converges to T_n(β*) entrywise at the rate of ζ^EM·ζ^L. In other words, T_n(β̂) is a good approximation of T_n(β*).

In the following, we lay out an assumption on several population quantities and the sample size n. Recall that β* = [α*, (γ*)^⊤]^⊤, where α* ∈ ℝ is the entry of interest, while γ* ∈ ℝ^d−1 is the nuisance parameter. By the notations introduced in §2.2, [I(β*)]_γ,γ ∈ ℝ^{(d−1)×(d−1)} and [I(β*)]_γ,α ∈ ℝ^(d−1)×1 denote the submatrices of the Fisher information matrix I(β*) ∈ ℝ^d×d. We define w*, $s_{w}^{*}$ and $𝒮_{w}^{*}$ as

w^{*} = {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α} \in ℝ^{d - 1}, s_{w}^{*} = {‖ w^{*} ‖}_{0}, and 𝒮_{w}^{*} = supp (w^{*}) .

(4.1)

We define λ₁ [I(β*)] and λ_d [I(β*)] as the largest and smallest eigenvalues of I(β*), and

{[I (β^{*})]}_{α | γ} = {[I (β^{*})]}_{α, α} - {[I (β^{*})]}_{γ, α}^{⊤} \cdot {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α} \in ℝ .

(4.2)

According to (4.1) and (4.2), we can easily verify that

{[I (β^{*})]}_{α | γ} = [1, - {(w^{*})}^{⊤}] \cdot I (β^{*}) \cdot {[1, - {(w^{*})}^{⊤}]}^{⊤} .

(4.3)

The following assumption ensures that λ_d [I(β*)] >0. Hence, [I(β*)]_γ,γ in (4.1) is invertible. Also, according to (4.3) and the fact that λ_d[I(β*)] >0, we have [I(β*)]_α|γ >0.

Assumption 4.5

We impose the following assumptions.

For positive absolute constants ρ_max and ρ_min, we assume
$ρ_{max} \geq λ_{1} [I (β^{*})] \geq λ_{d} [I (β^{*})] \geq ρ_{min} > 0, {[I (β^{*})]}_{α | γ} = O (1), {[I (β^{*})]}_{α | γ}^{- 1} = O (1) .$ (4.4)
The tuning parameter λ of the Dantzig selector in (2.10) is set to
$λ = C \cdot (ζ^{T} + ζ^{L} \cdot ζ^{E M}) \cdot (1 + {‖ w^{*} ‖}_{1}),$ (4.5)
where C ≥ 1 is a sufficiently large absolute constant. We suppose the sample size n is sufficiently large such that
$max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1), ζ^{E M} = o (1), s_{w}^{*} \cdot λ \cdot ζ^{G} = o (1 / \sqrt{n}),$ (4.6)

$λ \cdot ζ^{E M} = o (1 / \sqrt{n}), max {1, {‖ w^{*} ‖}_{1}} \cdot ζ^{L} \cdot {(ζ^{E M})}^{2} = o (1 / \sqrt{n}) .$

The assumption on λ_d[I(β*)] guarantees that the Fisher information matrix is positive definite. The other assumptions in (4.4) guarantee the existence of the asymptotic variance of $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ)$ in the score statistic defined in (2.11) and that of $\sqrt{n} \cdot \bar{α} (\hat{β}, λ)$ in the Wald statistic defined in (2.19). Similar assumptions are standard in existing asymptotic inference results. For example, for mixture of regression model, Khalili and Chen (2007) impose variants of these assumptions.

For specific models, we will show that ζ^EM, ζ^G, ζ^T and λ all decrease with n, while ζ^L increases with n at a slow rate. Therefore, the assumptions in (4.6) ensure that the sample size n is sufficiently large. We will make these assumptions more explicit after we specify ζ^EM, ζ^G, ζ^T and ζ^L for each model. Note the assumptions in (4.6) imply that $s_{w}^{*} = {‖ w^{*} ‖}_{0}$ needs to be small. For instance, for λ specified in (4.5), $max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)$ in (4.6) implies $s_{w}^{*} \cdot ζ^{T} = o (1)$ . In the following, we will prove that ζ^T is of the order $\sqrt{log d / n}$ . Hence, we require that $s_{w}^{*} = o (\sqrt{n / log d}) ≪ d - 1$ , i.e., w* ∈ ℝ^d−1 is sparse. We can understand this sparsity assumption as follows.

According to the definition of w* in (4.1), we have [I(β*)]_γ,γ·w* = [I(β*)]_γ,α. Therefore, such a sparsity assumption suggests that [I(β*)]_γ,α lies within the span of a few columns of [I(β*)]_γ,γ.
By block matrix inversion formula we have {[I(β*)]⁻¹}_γ,α = δ · w*, where δ ∈ ℝ. Hence, it can also be understood as a sparsity assumption on the (d − 1) × 1 submatrix of [I(β*)]⁻¹.

Such a sparsity assumption on w* is necessary, because otherwise it is difficult to accurately estimate w* in high dimensional regimes. In the context of high dimensional generalized linear models, Zhang and Zhang (2014); van de Geer et al. (2014) impose similar sparsity assumptions.

4.1 Main Results

Equipped with Conditions 4.1–4.4 and Assumption 4.5, we are now ready to establish our theoretical results to justify the inferential methods proposed in §2.2. We first cover the decorrelated score test and then the decorrelated Wald test. Finally, we establish the optimality of our proposed methods.

Decorrelated Score Test

The next theorem establishes the asymptotic normality of the decorrelated score statistic defined in (2.11).

Theorem 4.6

We consider β* = [α*, (γ*)^⊤]^⊤ with α* = 0. Under Assumption 4.5 and Conditions 4.1–4.4, we have that for n → ∞,

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) / {- {[T_{n} ({\hat{β}}_{0})]}_{α | γ}}^{1 / 2} \overset{D}{\to} N (0, 1),

(4.7)

where β̂₀ and [T_n(β̂₀)]_α|γ ∈ ℝ are defined in (2.11).

Proof

See §5.2 for a detailed proof.

Decorrelated Wald Test

The next theorem provides the asymptotic normality of $\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] \cdot {- {[T_{n} (\hat{β})]}_{α | γ}}^{1 / 2}$ , which implies the confidence interval for α* in (2.20) is valid. In particular, for testing the null hypothesis H₀: α* = 0, the next theorem implies the asymptotic normality of the decorrelated Wald statistic defined in (2.19).

Theorem 4.7

Under Assumption 4.5 and Conditions 4.1–4.4, we have that for n → ∞,

\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] \cdot {- {[T_{n} (\hat{β})]}_{α | γ}}^{1 / 2} \overset{D}{\to} N (0, 1),

(4.8)

where [T_n(β̂)]_α|γ ∈ ℝ is defined by replacing β̂₀ with β̂ in (2.11).

Proof

See §5.2 for a detailed proof.

Optimality

In Lemma 5.3 in the proof of Theorem 4.6 we show that, the limiting variance of the decorrelated score function $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ)$ is [I(β*)]_α|γ, which is defined in (4.2). In fact, van der Vaart (2000) proves that for inferring α* in the presence of the nuisance parameter γ*, [I(β*)]_α|γ is named the semiparametric efficient information, i.e., the minimum limiting variance of the (rescaled) score function. Our proposed decorrelated score function achieves such an information lower bound and is hence optimal. Because the decorrelated Wald test and the respective confidence interval are built on the decorrelated score function, the limiting variance of $\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}]$ and the size of the confidence interval are also optimal in terms of the semiparametric information lower bound.

4.2 Implications for Specific Models

To establish the high dimensional inference results for each model, we only need to verify Conditions 4.1–4.4 and determine the key quantities ζ^EM, ζ^G, ζ^T and ζ^L. In the following, we focus on Gaussian mixture and mixture of regression models.

Gaussian Mixture Model

The following lemma verifies Conditions 4.1 and 4.2.

Lemma 4.8

We have that Conditions 4.1 and 4.2 hold with

ζ^{E M} = \frac{\sqrt{ŝ} \cdot Δ^{G M M} (s^{*})}{1 - exp (- C \cdot r^{2} / 2)} \cdot \sqrt{\frac{log d \cdot T}{n}}, and ζ^{G} = ({‖ β^{*} ‖}_{\infty} + σ) \cdot \sqrt{\frac{log d}{n}},

where ŝ, Δ^GMM(s*), r and T are as defined in Theorem 3.7.

Proof

See §C.6 for a detailed proof.

By our discussion that follows Theorem 3.7, we have that ŝ and Δ^GMM(s*) are of the same order as s* and $\sqrt{s^{*}}$ respectively, and T is roughly of the order $\sqrt{log n}$ . Therefore, ζ^EM is roughly of the order $s^{*} \cdot \sqrt{log d / n \cdot log n}$ . The following lemma verifies Condition 4.3 for Gaussian mixture model.

Lemma 4.9

We have that Condition 4.3 holds with

ζ^{T} = ({‖ β^{*} ‖}_{\infty}^{2} + σ^{2}) / σ^{2} \cdot \sqrt{\frac{\log d}{n}} .

Proof

See §C.7 for a detailed proof.

The following lemma establishes Condition 4.4 for Gaussian mixture model.

Lemma 4.10

We have that Condition 4.4 holds with

ζ^{L} = {({‖ β^{*} ‖}_{\infty}^{2} + σ^{2})}^{3 / 2} / σ^{4} \cdot {(log d + log n)}^{3 / 2} .

Proof

See §C.8 for a detailed proof.

Equipped with Lemmas 4.8–4.10, we establish the inference results for Gaussian mixture model.

Theorem 4.11

Under Assumption 4.5, we have that for n→∞, (4.7) and (4.8) hold for Gaussian mixture model. Also, the proposed confidence interval for α* in (2.20) is valid.

In fact, for Gaussian mixture model we can make (4.6) in Assumption 4.5 more transparent by plugging in ζ^EM, ζ^G, ζ^T and ζ^L specified above. Particularly, for simplicity we assume all quantities except $s_{w}^{*}$ , s*, d and n are absolute constants. Then we can verify that (4.6) holds if

max {s_{w}^{*}, s^{*}}^{2} \cdot {(s^{*})}^{2} \cdot {(log d)}^{5} = o [n / {(log n)}^{2}] .

(4.9)

According to the discussion following Theorem 3.7, we require s*·log d = o(n/log n) for the estimator β̂ to be consistent. In comparison, (4.9) illustrates that high dimensional inference requires a higher sample complexity than parameter estimation. In the context of high dimensional generalized linear models, Zhang and Zhang (2014); van de Geer et al. (2014) also observe the same phenomenon.

Mixture of Regression Model

The following lemma verifies Conditions 4.1 and 4.2. Recall that ŝ, T, $Δ_{1}^{M R} (s^{*})$ and $Δ_{2}^{M R} (s^{*})$ are defined in Theorem 3.10, while σ denotes the standard deviation of the Gaussian noise in mixture of regression model.

Lemma 4.12

We have that Conditions 4.1 and 4.2 hold with

ζ^{E M} = \sqrt{ŝ} \cdot Δ^{M R} (s^{*}) \cdot \sqrt{\frac{log d \cdot T}{n}}, and ζ^{G} = max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1, \sqrt{s^{*}} \cdot {‖ β^{*} ‖}_{2}} \cdot \sqrt{\frac{log d}{n}},

where we have $Δ^{M R} (s^{*}) = Δ_{1}^{M R} (s^{*})$ for the maximization implementation of the M-step (Algorithm 2), and $Δ^{M R} (s^{*}) = Δ_{2}^{M R} (s^{*})$ for the gradient ascent implementation of the M-step (Algorithm 3).

Proof

See §C.9 for a detailed proof.

By our discussion that follows Theorem 3.10, we have that ŝ is of the same order as s*. For the maximization implementation of the M-step (Algorithm 2), we have that $Δ^{M R} (s^{*}) = Δ_{1}^{M R} (s^{*})$ is of the same order as $\sqrt{s^{*}}$ . Meanwhile, for the gradient ascent implementation in Algorithm 3, we have that $Δ^{M R} (s^{*}) = Δ_{2}^{M R} (s^{*})$ is of the same order as s*. Hence, ζ^EM is of the order $s^{*} \cdot \sqrt{log d / n \cdot log n}$ or ${(s^{*})}^{3 / 2} \cdot \sqrt{log d / n \cdot log n}$ correspondingly, since T is roughly of the order $\sqrt{log n}$ . The next lemma establishes Condition 4.3 for mixture of regression model.

Lemma 4.13

We have that Condition 4.3 holds with

ζ^{T} = (log n + log d) \cdot [(log n + log d) \cdot {‖ β^{*} ‖}_{1}^{2} + σ^{2}] / σ^{2} \cdot \sqrt{\frac{log d}{n}} .

Proof

See §C.10 for a detailed proof.

The following lemma establishes Condition 4.4 for mixture of regression model.

Lemma 4.14

We have that Condition 4.4 holds with

ζ^{L} = {({‖ β^{*} ‖}_{1} + σ)}^{3} \cdot {(log n + log d)}^{3} / σ^{4} .

Proof

See §C.11 for a detailed proof.

Equipped with Lemmas 4.12–4.14, we are now ready to establish the high dimensional inference results for mixture of regression model.

Theorem 4.15

For mixture of regression model, under Assumption 4.5, both (4.7) and (4.8) hold as n → ∞. Also, the proposed confidence interval for α* in (2.20) is valid.

Similar to the discussion that follows Theorem 4.11, we can make (4.6) in Assumption 4.5 more explicit by plugging in ζ^EM, ζ^G, ζ^T and ζ^L specified in Lemmas 4.12–4.14. Assuming all quantities except $s_{w}^{*}$ , s*, d and n are absolute constants, we have that (4.6) holds if

max {s_{w}^{*}, s^{*}}^{2} \cdot {(s^{*})}^{4} \cdot {(log d)}^{8} = o [n / {(log n)}^{2}] .

In contrast, for high dimensional estimation, we only require (s*)² · log d = o(n/log n) to ensure the consistency of β̂ by our discussion following Theorem 3.10.

5 Proof of Main Results

We lay out a proof sketch of the main theory. First we prove the results in Theorem 3.4 for parameter estimation and computation. Then we establish the results in Theorems 4.6 and 4.7 for inference.

5.1 Proof of Results for Computation and Estimation

Proof of Theorem 3.4

First we introduce some notations. Recall that the trunc(·, ·) function is defined in (2.7). We define β̅^(t+0.5), β̅^(t+1) ∈ ℝ^d as

{\bar{β}}^{(t + 0.5)} = M (β^{(t)}), {\bar{β}}^{(t + 1)} = trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) .

(5.1)

As defined in (3.1) or (3.2), M(·) is the population version M-step with the maximization or gradient ascent implementation. Here 𝒮̂^(t+0.5) denotes the set of index j’s with the top ŝ largest $| β_{j}^{(t + 0.5)} | ’ s$ . It is worth noting 𝒮̂^(t+0.5) is calculated based on β^(t+0.5) in the truncation step (line 6 of Algorithm 4), rather than based on β̅^(t+0.5) defined in (5.1).

Our goal is to characterize the relationship between ‖β^(t+1) − β*‖₂ and ‖β^(t) − β*‖₂. According to the definition of the truncation step (line 6 of Algorithm 4) and triangle inequality, we have

{‖ β^{(t + 1)} - β^{*} ‖}_{2} = {‖ trunc (β^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) - β^{*} ‖}_{2} \leq {‖ trunc (β^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) - trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) ‖}_{2} + {‖ trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) - β^{*} ‖}_{2} = \underset{(i)}{\underset{︸}{{‖ trunc (β^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) - trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) ‖}_{2}}} + \underset{(i i)}{\underset{︸}{{‖ {\bar{β}}^{(t + 1)} - β^{*} ‖}_{2}}},

(5.2)

where the last equality is obtained from (5.1). According to the definition of the trunc(·, ·) function in (2.7), the two terms within term (i) both have support 𝒮̂^(t+0.5) with cardinality ŝ. Thus, we have

{‖ trunc (β^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) - trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) ‖}_{2} = {‖ {(β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)})}_{{\hat{𝒮}}^{(t + 0.5)}} ‖}_{2} \leq \sqrt{ŝ} \cdot {‖ {(β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)})}_{{\hat{𝒮}}^{(t + 0.5)}} ‖}_{\infty} \leq \sqrt{ŝ} \cdot {‖ β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)} ‖}_{\infty} .

(5.3)

Since we have β^(t+0.5) = M_n(β^(t)) and β̅^(t+0.5) = M(β^(t)), we can further establish an upper bound for the right-hand side by invoking Condition 3.3.

Our subsequent proof will establish an upper bound for term (ii) in (5.2) in two steps. We first characterize the relationship between ‖β̅^(t+1) − β*‖₂ and ‖β̅^(t+0.5) − β*‖₂ and then the relationship between ‖β̅^(t+0.5) − β*‖₂ and ‖β^(t) − β*‖₂. The next lemma accomplishes the first step. Recall that ŝ is the sparsity parameter in Algorithm 4, while s* is the sparsity level of the true parameter β*.

Lemma 5.1

Suppose that we have

{‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \leq κ \cdot {‖ β^{*} ‖}_{2}

(5.4)

for some κ ∈ (0, 1). Assuming that we have

ŝ \geq \frac{4 \cdot {(1 + κ)}^{2}}{{(1 - κ)}^{2}} \cdot s^{*}, and \sqrt{ŝ} \cdot {‖ β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)} ‖}_{\infty} \leq \frac{{(1 - κ)}^{2}}{2 \cdot (1 + κ)} \cdot {‖ β^{*} ‖}_{2},

(5.5)

then it holds that

{‖ {\bar{β}}^{(t + 1)} - β^{*} ‖}_{2} \leq \frac{C \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot {‖ β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)} ‖}_{\infty} + {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} .

(5.6)

Proof

The proof is based on fine-grained analysis of the relationship between 𝒮̂^(t+0.5) and the true support 𝒮*. In particular, we focus on three index sets, namely, ℐ₁ = 𝒮*\𝒮̂^(t+0.5), ℐ₂ = 𝒮* ∩ 𝒮̂^(t+0.5) and ℐ₃= 𝒮̂^(t+0.5)\𝒮*. Among them, ℐ₂ characterizes the similarity between 𝒮̂^(t+0.5) and 𝒮*, while ℐ₁ and ℐ₃ characterize their difference. The key proof strategy is to represent the three distances in (5.6) with the ℓ₂-norms of the restrictions of β̅^(t+0.5) and β* on the three index sets. In particular, we focus ${‖ {\bar{β}}_{ℐ_{1}}^{(t + 0.5)} ‖}_{2}$ and ${‖ β_{ℐ_{1}}^{*} ‖}_{2}$ . In order to quantify these ℓ₂-norms, we establish a fine-grained characterization for the absolute values of β̅^(t+0.5)’s entries that are selected and eliminated within the truncation operation β̅^(t+1) ← trunc(β̅^(t+0.5), 𝒮̂^(t+0.5)). See §B.1 for a detailed proof.

Lemma 5.1 is central to the proof of Theorem 3.4. In detail, the assumption in (5.4) guarantees β̅^(t+0.5) is within the basin of attraction. In (5.5), the first assumption ensures the sparsity parameter ŝ in Algorithm 4 is set to be sufficiently large, while second ensures the statistical error is sufficiently small. These assumptions will be verified in the proof of Theorem 3.4. The intuition behind (5.6) is as follows. Let 𝒮̅^(t+0.5) = supp(β̅^(t+0.5), ŝ), where supp(·, ·) is defined in (2.6). By triangle inequality, the left-hand side of (5.6) satisfies

{‖ {\bar{β}}^{(t + 1)} - β^{*} ‖}_{2} \leq \underset{(i)}{\underset{︸}{{‖ {\bar{β}}^{(t + 1)} - trunc ({\bar{β}}^{(t + 0.5)}, {\bar{𝒮}}^{(t + 0.5)}) ‖}_{2}}} + \underset{(i i)}{\underset{︸}{{‖ trunc ({\bar{β}}^{(t + 0.5)}, {\bar{𝒮}}^{(t + 0.5)}) - β^{*} ‖}_{2}}} .

(5.7)

Intuitively, the two terms on right-hand side of (5.6) reflect terms (i) and (ii) in (5.7) correspondingly. In detail, for term (i) in (5.7), recall that according to (5.1) and line 6 of Algorithm 4 we have

{\bar{β}}^{(t + 1)} = trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}), where {\hat{𝒮}}^{(t + 0.5)} = supp (β^{(t + 0.5)}, ŝ) .

As the sample size n is sufficiently large, β̅^(t+0.5) and β^(t+0.5) are close, since they are attained by the population version and sample version M-steps correspondingly. Hence, 𝒮̅^(t+0.5) = supp(β̅^(t+0.5), ŝ) and 𝒮̂^(t+0.5) = supp(β^(t+0.5), ŝ) should be similar. Thus, in term (i), β̅^(t+1) = trunc(β̅^(t+0.5), 𝒮̂^(t+0.5)) should be close to trunc(β̅^(t+0.5), 𝒮̅^(t+0.5)) up to some statistical error, which is reflected by the first term on the right-hand side of (5.6).

Also, we turn to quantify the relationship between ‖β̅^(t+0.5) − β*‖₂ in (5.6) and term (ii) in (5.7). The truncation in term (ii) preserves the top ŝ coordinates of β̅^(t+0.5) with the largest magnitudes while setting others to zero. Intuitively speaking, the truncation incurs additional error to β̅^(t+0.5)’s distance to β*. Meanwhile, note that when β̅^(t+0.5) is close to β*, 𝒮̅^(t+0.5) is similar to 𝒮*. Therefore, the incurred error can be controlled, because the truncation doesn’t eliminate many relevant entries. In particular, as shown in the second term on the right-hand side of (5.6), such incurred error decays as ŝ increases, since in this case 𝒮̂^(t+0.5) includes more entries. According to the discussion for term (i) in (5.7), 𝒮̅^(t+0.5) is similar to 𝒮̂^(t+0.5), which implies that 𝒮̅^(t+0.5) should also cover more entries. Thus, fewer relevant entries are wrongly eliminated by the truncation and hence the incurred error is smaller. The extreme case is that, when ŝ → ∞, term (ii) in (5.7) becomes ‖β̅^(t+0.5) − β*‖₂, which indicates that no additional error is incurred by the truncation. Correspondingly, the second term on the right-hand side of (5.6) also becomes ‖β̅^(t+0.5) − β*‖₂.

Next, we turn to characterize the relationship between ‖β̅^(t+0.5) − β*‖₂ and ‖β^(t) − β*‖₂. Recall β̅^(t+0.5) = M(β^(t)) is defined in (5.1). The next lemma, which is adapted from Theorems 1 and 3 in Balakrishnan et al. (2014), characterizes the contraction property of the population version M-step defined in (3.1) or (3.2).

Lemma 5.2

Under the assumptions of Theorem 3.4, the following results hold for β^(t) ∈ ℬ.

For the maximization implementation of the M-step (Algorithm 2), we have
${‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \leq (γ_{1} / ν) \cdot {‖ β^{(t)} - β^{*} ‖}_{2} .$ (5.8)
For the gradient ascent implementation of the M-step (Algorithm 3), we have
${‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \leq (1 - 2 \cdot \frac{ν - γ_{2}}{ν + μ}) \cdot {‖ β^{(t)} - β^{*} ‖}_{2} .$ (5.9)

Here γ₁, γ₂, μ and ν are defined in Conditions 3.1 and 3.2.

Proof

The proof strategy is to first characterize the M-step using Q(·; β*). According to Condition Concavity-Smoothness(μ, ν, ℬ), −Q(·; β*) is ν-strongly convex and μ-smooth, and thus enjoys desired optimization guarantees. Then Condition Lipschitz-Gradient-1(γ₁, ℬ) or Lipschitz-Gradient-2(γ₂, ℬ) is invoked to characterize the difference between Q(·; β*) and Q(·; β^(t)). We provide the proof in §B.2 for the sake of completeness.

Equipped with Lemmas 5.1 and 5.2, we are now ready to prove Theorem 3.4.

Proof

To unify the subsequent proof for the maximization and gradient implementations of the M-step, we employ ρ ∈ (0, 1) to denote γ₁/ν in (5.8) or 1 – 2·(ν − γ₂)/(ν + μ) in (5.9). By the definitions of β̅^(t+0.5) and β^(t+0.5) in (5.1) and Algorithm 4, Condition Statistical-Error(ε, δ/T, ŝ, n/T, ℬ) implies

{‖ β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)} ‖}_{\infty} = {‖ M_{n / T} (β^{(t)}) - M (β^{(t)}) ‖}_{\infty} \leq ε

holds with probability at least 1 − δ/T. Then by taking union bound we have that, the event

ℰ = {{‖ β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)} ‖}_{\infty} \leq ε, for all t \in {0, \dots, T - 1}}

(5.10)

occurs with probability at least 1 − δ. Conditioning on ℰ, in the following we prove that

{‖ β^{(t)} - β^{*} ‖}_{2} \leq \frac{(\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε}{1 - \sqrt{ρ}} + ρ^{t / 2} \cdot {‖ β^{(0)} - β^{*} ‖}_{2}, for all t \in {1, \dots, T}

(5.11)

by mathematical induction.

Before we lay out the proof, we first prove β⁽⁰⁾ ∈ ℬ. Recall β^init is the initialization of Algorithm 4 and R is the radius of the basin of attraction ℬ. By the assumption of Theorem 3.4, we have

{‖ β^{init} - β^{*} ‖}_{2} \leq R / 2 .

(5.12)

Therefore, (5.12) implies ‖β^init − β*‖₂ < κ·‖β*‖₂ since R = κ·‖β*‖₂. Invoking the auxiliary result in Lemma B.1, we obtain

{‖ β^{(0)} - β^{*} ‖}_{2} \leq {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot {‖ β^{init} - β^{*} ‖}_{2} \leq {(1 + 4 \cdot \sqrt{1 / 4})}^{1 / 2} \cdot R / 2 < R .

(5.13)

Here the second inequality is from (5.12) as well as the assumption in (3.8) or (3.11), which implies s*/ŝ ≤ (1 − κ)²/[4·(1+κ)²] ≤ 1/4. Thus, (5.13) implies β⁽⁰⁾ ∈ ℬ. In the sequel, we prove that (5.11) holds for t = 1. By invoking Lemma 5.2 and setting t = 0 in (5.8) or (5.9), we obtain

{‖ {\bar{β}}^{(0.5)} - β^{*} ‖}_{2} \leq ρ \cdot {‖ β^{(0)} - β^{*} ‖}_{2} \leq ρ \cdot R < R = κ \cdot {‖ β^{*} ‖}_{2},

where the second inequality is from (5.13). Hence, the assumption in (5.4) of Lemma 5.1 holds for β̅^(0.5). Furthermore, by the assumptions in (3.8), (3.9), (3.11) and (3.12) of Theorem 3.4, we can also verify that the assumptions in (5.5) of Lemma 5.1 hold conditioning on the event ℰ defined in (5.10). By invoking Lemma 5.1 we have that (5.6) holds for t = 0. Further plugging ‖β^(t+0.5) − β̅^(t+0.5)‖_∞ ≤ ε in (5.10) into (5.6) with t = 0, we obtain

{‖ {\bar{β}}^{(1)} - β^{*} ‖}_{2} \leq \frac{C \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot ε + {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot {‖ {\bar{β}}^{(0.5)} - β^{*} ‖}_{2} .

(5.14)

Setting t = 0 in (5.8) or (5.9) of Lemma 5.2 and then plugging (5.8) or (5.9) into (5.14), we obtain

{‖ {\bar{β}}^{(1)} - β^{*} ‖}_{2} \leq \frac{C \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot ε + {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot ρ \cdot {‖ β^{(0)} - β^{*} ‖}_{2} .

(5.15)

For t = 0, plugging (5.3) into term (i) in (5.2), and (5.15) into term (ii) in (5.2), and then applying ‖β^(t+0.5) − β̅^(t+0.5)‖_∞ ≤ ε with t = 0 in (5.10), we obtain

{‖ β^{(1)} - β^{*} ‖}_{2} \leq \sqrt{ŝ} \cdot {‖ β^{(0.5)} - {\bar{β}}^{(0.5)} ‖}_{\infty} + \frac{C \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot ε + {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot ρ \cdot {‖ β^{(0)} - β^{*} ‖}_{2} \leq (\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε + {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot ρ \cdot {‖ β^{(0)} - β^{*} ‖}_{2} .

(5.16)

By our assumption that ŝ ≥ 16 ·(1/ρ − 1)⁻² ·s* in (3.8) or (3.11), we have ${(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \leq 1 / \sqrt{ρ}$ in (5.16). Hence, from (5.16) we obtain

{‖ β^{(1)} - β^{*} ‖}_{2} \leq (\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε + \sqrt{ρ} \cdot {‖ β^{(0)} - β^{*} ‖}_{2},

(5.17)

which implies that (5.11) holds for t = 1, since we have $1 - \sqrt{ρ} < 1$ in (5.11).

Suppose we have that (5.11) holds for some t ≥ 1. By (5.11) we have

{‖ β^{(t)} - β^{*} ‖}_{2} \leq \frac{(\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε}{1 - \sqrt{ρ}} + ρ^{t / 2} \cdot {‖ β^{(0)} - β^{*} ‖}_{2} \leq (1 - \sqrt{ρ}) \cdot R + \sqrt{ρ} \cdot R = R,

(5.18)

where the second inequality is from (5.13) and our assumption $(\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε \leq {(1 - \sqrt{ρ})}^{2} \cdot R$ in (3.9) or (3.12). Therefore, by (5.18) we have β^(t) ∈ ℬ. Then (5.8) or (5.9) in Lemma 5.2 implies

{‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \leq ρ \cdot {‖ β^{(t)} - β^{*} ‖}_{2} \leq ρ \cdot R < R = κ \cdot {‖ β^{*} ‖}_{2},

where the third inequality is from ρ ∈ (0, 1). Following the same proof for (5.17), we obtain

{‖ β^{(t + 1)} - β^{*} ‖}_{2} \leq (\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε + \sqrt{ρ} \cdot {‖ β^{(t)} - β^{*} ‖}_{2} \leq (1 + \frac{\sqrt{ρ}}{1 - \sqrt{ρ}}) \cdot (\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε + \sqrt{ρ} \cdot ρ^{t / 2} \cdot {‖ β^{(0)} - β^{*} ‖}_{2} = \frac{(\sqrt{ŝ} + C / \sqrt{1 - κ} \cdot \sqrt{s^{*}}) \cdot ε}{1 - \sqrt{ρ}} + ρ^{(t + 1) / 2} \cdot {‖ β^{(0)} - β^{*} ‖}_{2} .

Here the second inequality is obtained by plugging in (5.11) for t. Hence we have that (5.11) holds for t + 1. By induction, we conclude that (5.11) holds conditioning on the event ℰ defined in (5.10), which occurs with probability at least 1 − δ. By plugging the specific definitions of ρ into (5.11), and applying ‖β⁽⁰⁾ − β*‖₂ ≤ R in (5.13) to the right-hand side of (5.11), we obtain (3.10) and (3.13).

5.2 Proof of Results for Inference

Proof of Theorem 4.6

We establish the asymptotic normality of the decorrelated score statistic defined in (2.11) in two steps. We first prove the asymptotic normality of $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ)$ , where β̂₀ is defined in (2.11) and S_n(·, ·) is defined in (2.9). Then we prove that − [T_n(β̂₀)]_α|γ defined in (2.11) is a consistent estimator of $\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) ’ s$ asymptotic variance. The next lemma accomplishes the first step. Recall I(β*) = −𝔼_β*[∇²ℓ_n(β*)]/n is the Fisher information for ℓ_n(β*) defined in (2.2).

Lemma 5.3

Under the assumptions of Theorem 4.6, we have that for n → ∞,

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}),

where [I(β*)]_α|γ is defined in (4.2).

Proof

Our proof consists of two steps. Note that by the definition in (2.9) we have

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) = \sqrt{n} \cdot {[\nabla_{1} Q_{n} ({\hat{β}}_{0}; {\hat{β}}_{0})]}_{α} - \sqrt{n} \cdot w {({\hat{β}}_{0}, λ)}^{⊤} \cdot {[\nabla_{1} Q_{n} ({\hat{β}}_{0}; {\hat{β}}_{0})]}_{γ} .

(5.19)

Recall that $w^{*} = {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α}$ is defined in (4.1). At the first step, we prove

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) = \sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} + o_{ℙ} (1) .

(5.20)

In other words, replacing β̂₀ and w(β̂₀, λ) in (5.19) with the corresponding population quantities β* and w* only introduces an o_ℙ(1) error term. Meanwhile, by Lemma 2.1 we have ∇₁Q_n(β*; β*) = ∇ℓ_n(β*)/n. Recall that ℓ_n(·) is the log-likelihood defined in (2.2), which implies that in (5.20)

\sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} = \sqrt{n} \cdot [1, - {(w^{*})}^{⊤}] \cdot \nabla ℓ_{n} (β^{*}) / n

is a (rescaled) average of n i.i.d. random variables. At the second step, we calculate the mean and variance of each term within this average and invoke the central limit theorem. Finally we combine these two steps by invoking Slutsky’s theorem. See §C.3 for a detailed proof.

The next lemma establishes the consistency of −[T_n(β̂₀)]_α|γ for estimating [I(β*)]_α|γ. Recall that [T_n(β̂₀)]_α|γ ∈ ℝ and [I(β*)]_α|γ ∈ ℝ are defined in (2.11) and (4.2) respectively.

Lemma 5.4

Under the assumptions of Theorem 4.6, we have

{[T_{n} ({\hat{β}}_{0})]}_{α | γ} + {[I (β^{*})]}_{α | γ} = o_{ℙ} (1) .

(5.21)

Proof

For notational simplicity, we abbreviate w(β̂₀, λ) in the definition of [T_n(β̂₀)]_α|γ as ŵ₀. By (2.11) and (4.3), we have

{[T_{n} ({\hat{β}}_{0})]}_{α | γ} = (1, - ŵ_{0}^{⊤}) \cdot T_{n} ({\hat{β}}_{0}) \cdot {(1, - ŵ_{0}^{⊤})}^{⊤}, {[I (β^{*})]}_{α | γ} = [1, - {(w^{*})}^{⊤}] \cdot I (β^{*}) \cdot {[1, - {(w^{*})}^{⊤}]}^{⊤} .

First, we establish the relationship between ŵ₀ and w* by analyzing the Dantzig selector in (2.10). Meanwhile, by Lemma 2.1 we have 𝔼_β* [T_n(β*)] = −I(β*). Then by triangle inequality we have

| T_{n} ({\hat{β}}_{0}) + I (β^{*}) | \leq \underset{(i)}{\underset{︸}{| T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) |}} + \underset{(i i)}{\underset{︸}{| T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] |}} .

We prove term (i) is o_ℙ(1) by quantifying the Lipschitz continuity of T_n(·) using Condition 4.4. We then prove term (ii) is o_ℙ(1) by concentration analysis. Together with the result on the relationship between ŵ₀ and w* we establish (5.21). See §C.4 for a detailed proof.

Combining Lemmas 5.3 and 5.4 using Slutsky’s theorem, we obtain Theorem 4.6.

Proof of Theorem 4.7

Similar to the proof of Theorem 4.6, we first prove $\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}]$ is asymptotically normal, where α̅(β̂, λ) is defined in (2.18), while β̂ is the estimator attained by the high dimensional EM algorithm (Algorithm 4). By Lemma 5.4, we then show that −{[T_n(β̂)]_α|γ}⁻¹ is a consistent estimator of $\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] ’ s$ asymptotic variance. Here recall that [T_n(β̂)]_α|γ is defined by replacing β̂₀ with β̂ in (2.11). The following lemma accomplishes the first step.

Lemma 5.5

Under the assumptions of Theorem 4.7, we have that for n → ∞,

\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}^{- 1}),

(5.22)

where [I(β*)]_α|γ > 0 is defined in (4.2).

Proof

Our proof consists of two steps. Similar to the proof of Lemma 5.3, we first prove

\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] = - \sqrt{n} \cdot {[I (β^{*})]}_{α | γ}^{- 1} \cdot {{[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ}} + o_{ℙ} (1) .

Then we prove that, as n → ∞,

\sqrt{n} \cdot {[I (β^{*})]}_{α | γ}^{- 1} \cdot {{[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ}} \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}^{- 1}) .

Combining the two steps by Slutsky’s theorem, we obtain (5.22). See §C.5 for a detailed proof.

Now we accomplish the second step for proving Theorem 4.7. Note that by replacing β̂₀ with β̂ in the proof of Lemma 5.4, we have

{[T_{n} (\hat{β})]}_{α | γ} + {[I (β^{*})]}_{α | γ} = o_{ℙ} (1) .

(5.23)

Meanwhile, by the definitions of w* and [I(β*)]_α|γ in (4.1) and (4.2), we have

{[I (β^{*})]}_{α | γ} = {[I (β^{*})]}_{α, α} - {[I (β^{*})]}_{γ, α}^{⊤} \cdot {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α} = {[I (β^{*})]}_{α, α} - 2 \cdot {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} = [1, {(w^{*})}^{⊤}] \cdot I (β^{*}) \cdot {[1, {(w^{*})}^{⊤}]}^{⊤} .

(5.24)

Note that (4.4) in Assumption 4.5 implies that I(β*) is positive definite, which yields [I(β*)]_α|γ > 0 by (5.24). Also, (4.4) gives [I(β*)]_α|γ = O(1) and ${[I (β^{*})]}_{α | γ}^{- 1} = O (1)$ . Hence, by (5.23) we have

- {[T_{n} (\hat{β})]}_{α | γ}^{- 1} = {[I (β^{*})]}_{α | γ}^{- 1} + o_{ℙ} (1) .

(5.25)

Combining (5.22) in Lemma 5.5 and (5.25) by invoking Slutsky’s theorem, we obtain Theorem 4.7.

6 Numerical Results

In this section, we lay out numerical results illustrating the computational and statistical efficiency of the methods proposed in §2. In §6.1 and §6.2, we present the results for parameter estimation and asymptotic inference respectively. Throughout §6, we focus on the high dimensional EM algorithm without resampling, which is illustrated in Algorithm 1.

6.1 Computation and Estimation

We empirically study two latent variable models: Gaussian mixture model and mixture of regression model. The first synthetic dataset is generated according to the Gaussian mixture model defined in (2.24). The second one is generated according to the mixture of regression model defined in (2.25). We set d = 256 and s* = 5. In particular, the first 5 entries of β* are set to {4, 4, 4, 6, 6}, while the rest 251 entries are set to zero. For Gaussian mixture model, the standard deviation of V in (2.24) is set to σ = 1. For mixture of regression model, the standard deviation of the Gaussian noise V in (2.25) is σ = 0.1. For both datasets, we generate n = 100 data points.

We apply the proposed high dimensional EM algorithm to both datasets. In particular, we apply the maximization implementation of the M-step (Algorithm 2) to Gaussian mixture model, and the gradient ascent implementation of the M-step (Algorithm 3) to mixture of regression model. We set the stepsize in Algorithm 3 to be η = 1. In Figure 1, we illustrate the evolution of the optimization error ‖β^(t) − β̂*‖₂ and overall estimation error ‖β^(t) − β*‖₂ with respect to the iteration index t. Here β̂ is the final estimator and β* is the true parameter.

Evolution of the optimization error ‖β^(t) − β̂‖₂ and overall estimation error ‖β^(t) − β*‖₂ (in logarithmic scale). (a) The high dimensional EM algorithm for Gaussian mixture model (with the maximization implementation of the M-step). (b) The high dimensional EM algorithm for mixture of regression model (with the gradient ascent implementation of the M-step). Here note that in (a) the optimization error for t = 5, …, 10 is truncated due to arithmetic underflow.

Figure 1 illustrates the geometric decay of the optimization error, as predicted by Theorem 3.4. As the optimization error decreases to zero, β^(t) converges to the final estimator β̂ and the overall estimation error ‖β^(t) − β*‖₂ converges to the statistical error ‖β̂ − β*‖₂. In Figure 2, we illustrate the statistical rate of convergence of ‖β̂ − β*‖₂ In particular, we plot ‖β̂ − β*‖₂ against $\sqrt{s^{*} \cdot log d / n}$ with varying s* and n. Figure 2 illustrates that the statistical error exhibits a linear dependency on $\sqrt{s^{*} \cdot log d / n}$ . Hence, the proposed high dimensional EM algorithm empirically achieves an estimator with the optimal $\sqrt{s^{*} \cdot log d / n}$ statistical rate of convergence.

Statistical error ‖β̂ − β*‖₂ plotted against $\sqrt{s^{*} \cdot log d / n}$ with fixed d=128 and varying s* and n. (a) The high dimensional EM algorithm for Gaussian mixture model (with the maximization implementation of the M-step). (b) The high dimensional EM algorithm for mixture of regression model (with the gradient ascent implementation of the M-step).

6.2 Asymptotic Inference

To examine the validity of the proposed hypothesis testing procedures, we consider the same setting as in §6.1. Recall that we have β* = [4, 4, 4, 6, 6, 0, …, 0]^⊤ ∈ ℝ²⁵⁶. We consider the null hypothesis $H_{0} : β_{10}^{*} = 0$ . We construct the decorrelated score and Wald statistics based on the estimator obtained in the previous section, and fix the significance level to δ = 0.05. We repeat the experiment for 500 times and calculate the rejection rate as the type I error. In Table 1, we report the type I errors of the decorrelated score and Wald tests. In detail, Table 1 illustrates that the type I errors achieved by the proposed hypothesis testing procedures are close to the significance level, which validates our proposed hypothesis testing procedures.

Table 1.

Type I errors of the decorrelated Score and Wald tests

	Gaussian Mixture Model	Mixture of Regression
Decorrelated Score Test	0.052	0.050
Decorrelated Wald Test	0.049	0.049

Open in a new tab

7 Conclusion

We propose a novel high dimensional EM algorithm which naturally incorporates sparsity structure. Our theory shows that, with a suitable initialization, the proposed algorithm converges at a geometric rate and achieves an estimator with the (near-)optimal statistical rate of convergence. Beyond point estimation, we further propose the decorrelated score and Wald statistics for testing hypotheses and constructing confidence intervals for low dimensional components of high dimensional parameters. We apply the proposed algorithmic framework to a broad family of high dimensional latent variable models. For these models, our framework establishes the first computationally feasible approach for optimal parameter estimation and asymptotic inference under high dimensional settings. Thorough numerical simulations are provided to back up our theory.

Acknowledgments

The authors are grateful for the support of NSF CAREER Award DMS1454377, NSF IIS1408910, NSF IIS1332109, NIH R01MH102339, NIH R01GM083084, and NIH R01HG06841.

Appendix

A Applications to Latent Variable Models

We provide the specific forms of Q_n(·; ·), M_n(·) and T_n(·) for the models defined in §2.3. Recall that Q_n(·; ·) is defined in (2.5), M_n(·) is defined in Algorithms 2 and 3, and T_n(·) is defined in (2.8).

A.1 Gaussian Mixture Model

Let y₁, …, y_n be the n realizations of Y. For the E-step (line 4 of Algorithm 1), we have

Q_{n} (β'; β) = - \frac{1}{2 n} \sum_{i = 1}^{n} ω_{β} (y_{i}) \cdot {‖ y_{i} - β' ‖}_{2}^{2} + [1 - ω_{β} (y_{i})] \cdot {‖ y_{i} + β' ‖}_{2}^{2}, where ω_{β} (y) = \frac{1}{1 + exp (- 〈 β, y 〉 / σ^{2})} .

(A.1)

The maximization implementation (Algorithm 2) of the M-step takes the form

M_{n} (β) = \frac{2}{n} \sum_{i = 1}^{n} ω_{β} (y_{i}) \cdot y_{i} - \frac{1}{n} \sum_{i = 1}^{n} y_{i} .

(A.2)

Meanwhile, for the gradient ascent implementation (Algorithm 3) of the M-step, we have

M_{n} (β) = β + η \cdot \nabla_{1} Q_{n} (β; β), where \nabla_{1} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (y_{i}) - 1] \cdot y_{i} - β .

Here η > 0 is the stepsize. For asymptotic inference, T_n(·) in (2.8) takes the form

T_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} ν_{β} (y_{i}) \cdot y_{i} \cdot y_{i}^{⊤} - I_{d}, where ν_{β} (y) = \frac{4 / σ^{2}}{[1 + exp (- 2 \cdot 〈 β, y 〉 / σ^{2})] \cdot [1 + exp (2 \cdot 〈 β, y 〉 / σ^{2})]} .

A.2 Mixture of Regression Model

Let y₁, …, y_n and x₁, …, x_n be the n realizations of Y and X. For the E-step (line 4 of Algorithm 1), we have

Q_{n} (β'; β) = - \frac{1}{2 n} \sum_{i = 1}^{n} ω_{β} (x_{i}, y_{i}) \cdot {(y_{i} - 〈 x_{i}, β' 〉)}^{2} + [1 - ω_{β} (x_{i}, y_{i})] \cdot {(y_{i} + 〈 x_{i}, β' 〉)}^{2}, where ω_{β} (x, y) = \frac{1}{1 + exp (- y \cdot 〈 β, x 〉 / σ^{2})} .

(A.3)

For the maximization implementation (Algorithm 2) of the M-step (line 5 of Algorithm 1), we have that M_n(β) = argmax_β′ Q_n(β′; β) satisfies

\hat{Σ} \cdot M_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) - 1] \cdot y_{i} \cdot x_{i}, where \hat{Σ} = \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} .

(A.4)

However, in high dimensional regimes, the sample covariance matrix Σ̂ is not invertible. To estimate the inverse covariance matrix of X, we use the CLIME estimator proposed by Cai et al. (2011), i.e.,

\hat{Θ} = \underset{Θ \in ℝ^{d \times d}}{argmin} {‖ Θ ‖}_{1, 1,} subject to {‖ \hat{Σ} \cdot Θ - I_{d} ‖}_{\infty, \infty} \leq λ^{CLIME},

(A.5)

where ‖·‖_1,1 and ‖·‖_∞,∞ are the sum and maximum of the absolute values of all entries respectively, and λ^CLIME > 0 is a tuning parameter. Based on (A.4), we modify the maximization implementation of the M-step to be

M_{n} (β) = \hat{Θ} \cdot \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) - 1] \cdot y_{i} \cdot x_{i} .

(A.6)

For the gradient ascent implementation (Algorithm 3) of the M-step, we have

M_{n} (β) = β + η \cdot \nabla_{1} Q_{n} (β; β), where \nabla_{1} Q_{n} (β, β) \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) \cdot y_{i} \cdot x_{i} - x_{i} \cdot x_{i}^{⊤} \cdot β] .

(A.7)

Here η > 0 is the stepsize. For asymptotic inference, T_n(·) in (2.8) takes the form

T_{n} (β) = \frac{1}{n} \sum_{i = 1}^{n} ν_{β} (x_{i}, y_{i}) \cdot x_{i} \cdot x_{i}^{⊤} \cdot y_{i}^{2} - \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤}, where ν_{β} (x, y) = \frac{4 / σ^{2}}{[1 + exp (- 2 \cdot y \cdot 〈 β, x 〉 / σ^{2})] \cdot [1 + exp (2 \cdot y \cdot 〈 β, x 〉 / σ^{2})]} .

It is worth noting that, for the maximization implementation of the M-step, the CLIME estimator in (A.5) requires that Σ⁻¹ is sparse, where Σ is the population covariance of X. Since we assume X ~ N (0, I_d), this requirement is satisfied. Nevertheless, for more general settings where Σ doesn’t possess such a structure, the gradient ascent implementation of the M-step is a better choice, since it doesn’t require inverse covariance estimation and is also more efficient in computation.

A.3 Regression with Missing Covariates

For notational simplicity, we define z_i ∈ ℝ^d (i = 1, …, n) as the vector with z_i,j = 1 if x_i,j is observed and z_i,j = 0 if x_i,j is missing. Let $x_{i}^{obs}$ be the subvector corresponding to the observed component of x_i. For the E-step (line 4 of Algorithm 1), we have

Q_{n} (β'; β) = \frac{1}{n} \sum_{i = 1}^{n} y_{i} \cdot {(β')}^{⊤} \cdot m_{β} (x_{i}^{obs}, y_{i}) - \frac{1}{2} \cdot {(β')}^{⊤} \cdot K_{β} (x_{i}^{obs}, y_{i}) \cdot β' .

(A.8)

Here m_β(·, ·) ∈ ℝ^d and K_β(·, ·) ∈ ℝ^d×d are defined as

m_{β} (x_{i}^{obs}, y_{i}) = z_{i} ⊙ x_{i} + \frac{y_{i} - 〈 β, z_{i} ⊙ x_{i} 〉}{σ^{2} + {‖ (1 - z_{i}) ⊙ β ‖}_{2}^{2}} \cdot (1 - z_{i}) ⊙ β,

(A.9)

K_{β} (x_{i}^{obs}, y_{i}) = diag (1 - z_{i}) + m_{β} (x_{i}^{obs}, y_{i}) \cdot {[m_{β} (x_{i}^{obs}, y_{i})]}^{⊤} - [(1 - z_{i}) ⊙ m_{β} (x_{i}^{obs}, y_{i})] \cdot {[(1 - z_{i}) ⊙ m_{β} (x_{i}^{obs}, y_{i})]}^{⊤},

(A.10)

where ⊙ denotes the Hadamard product and diag(1−z_i) is defined as the d×d diagonal matrix with the entries of 1−z_i ∈ ℝ^d on its diagonal. Note that maximizing Q_n(β′; β) over β′ requires inverting $K_{β} (x_{i}^{obs}, y_{i})$ , which may be not invertible in high dimensions. Thus, we consider the gradient ascent implementation (Algorithm 3) of the M-step (line 5 of Algorithm 1), in which we have

M_{n} (β) = β + η \cdot \nabla_{1} Q_{n} (β; β), where \nabla_{1} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} y_{i} \cdot m_{β} (x_{i}^{obs}, y_{i}) - K_{β} (x_{i}^{obs}, y_{i}) \cdot β .

(A.11)

Here η > 0 is the stepsize. For asymptotic inference, we can similarly calculate T_n(·) according to its definition in (2.8). However, here we omit the detailed form of T_n(·) since it is overly complicated.

B Proof of Results for Computation and Estimation

We provide the detailed proof of the main results in §3 for computation and parameter estimation. We first lay out the proof for the general framework, and then the proof for specific models.

B.1 Proof of Lemma 5.1

Proof

Recall β̅^(t+0.5) and β̅^(t+1) are defined in (5.1). Note that in (5.4) of Lemma 5.1 we assume

{‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \leq κ \cdot {‖ β^{*} ‖}_{2},

(B.1)

which implies

(1 - κ) \cdot {‖ β^{*} ‖}_{2} \leq {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} \leq (1 + κ) \cdot {‖ β^{*} ‖}_{2} .

(B.2)

For notational simplicity, we define

\bar{θ} = {\bar{β}}^{(t + 0.5)} / {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}, θ = β^{(t + 0.5)} / {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}, and θ^{*} = β^{*} / {‖ β^{*} ‖}_{2} .

(B.3)

Note that θ̅ and θ* are unit vectors, while θ is not, since it is obtained by normalizing β^(t+0.5) with ‖β̅^(t+0.5)‖₂. Recall that the supp(·, ·) function is defined in (2.6). Hence we have

supp (θ^{*}) = supp (β^{*}) = 𝒮^{*}, and supp (θ, ŝ) = supp (β^{(t + 0.5)}, ŝ) = {\hat{𝒮}}^{(t + 0.5)},

(B.4)

where the last equality follows from line 6 of Algorithm 4. To ease the notation, we define

ℐ_{1} = 𝒮^{*} \ {\hat{𝒮}}^{(t + 0.5)}, ℐ_{2} = 𝒮^{*} \cap {\hat{𝒮}}^{(t + 0.5)}, and ℐ_{3} = {\hat{𝒮}}^{(t + 0.5)} \ 𝒮^{*} .

(B.5)

Let s₁ = |ℐ₁|, s₂ = |ℐ₂| and s₃ = |ℐ₃| correspondingly. Also, we define Δ = 〈θ̅, θ*〉. Note that

Δ = 〈 \bar{θ}, θ^{*} 〉 = \sum_{j \in 𝒮^{*}} {\bar{θ}}_{j} \cdot θ_{j}^{*} = \sum_{j \in ℐ_{1}} {\bar{θ}}_{j} \cdot θ_{j}^{*} + \sum_{j \in ℐ_{2}} {\bar{θ}}_{j} \cdot θ_{j}^{*} \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2} + {‖ {\bar{θ}}_{ℐ_{2}} ‖}_{2} \cdot {‖ θ_{ℐ_{2}}^{*} ‖}_{2} .

(B.6)

Here the first equality is from supp(θ*) = 𝒮*, the second equality is from (B.5) and the last inequality is from Cauchy-Schwarz inequality. Furthermore, from (B.6) we have

Δ^{2} \leq {({‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2} + {‖ {\bar{θ}}_{ℐ_{2}} ‖}_{2} \cdot {‖ θ_{ℐ_{2}}^{*} ‖}_{2})}^{2} \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}^{2} \cdot ({‖ θ_{ℐ_{1}}^{*} ‖}_{2}^{2} + {‖ θ_{ℐ_{2}}^{*} ‖}_{2}^{2}) + {‖ {\bar{θ}}_{ℐ_{2}} ‖}_{2}^{2} \cdot ({‖ θ_{ℐ_{1}}^{*} ‖}_{2}^{2} + {‖ θ_{ℐ_{2}}^{*} ‖}_{2}^{2}) = {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}^{2} + {‖ {\bar{θ}}_{ℐ_{2}} ‖}_{2}^{2} \leq 1 - {‖ {\bar{θ}}_{ℐ_{3}} ‖}_{2}^{2} .

(B.7)

To obtain the second inequality, we expand the square and apply 2ab ≤ a² + b². In the equality and the last inequality of (B.7), we use the fact that θ* and θ̅ are both unit vectors.

By (2.6) and (B.4), 𝒮̂^(t+0.5) contains the index j’s with the top ŝ largest $| β_{j}^{(t + 0.5)} | ’ s$ . Therefore, we have

\frac{{‖ β_{ℐ_{3}}^{(t + 0.5)} ‖}_{2}^{2}}{s_{3}} = \frac{\sum_{j \in ℐ_{3}} {(β_{j}^{(t + 0.5)})}^{2}}{s_{3}} \geq \frac{\sum_{j \in ℐ_{1}} {(β_{j}^{(t + 0.5)})}^{2}}{s_{1}} = \frac{{‖ β_{ℐ_{1}}^{(t + 0.5)} ‖}_{2}^{2}}{s_{1}},

(B.8)

because from (B.5) we have ℐ₃ ⊆ 𝒮̂^(t+0.5) and ℐ₁ ∩ 𝒮̂^(t+0.5) = ∅. Taking square roots of both sides of (B.8) and then dividing them by ‖β̅^(t+0.5)‖₂ (which is nonzero according to (B.2)), by the definition of θ in (B.3) we obtain

\frac{{‖ θ_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}} \geq \frac{{‖ θ_{ℐ_{1}} ‖}_{2}}{\sqrt{s_{1}}} .

(B.9)

Equipped with (B.9), we now quantify the relationship between ‖θ̅_ℐ₃‖₂ and ‖θ̅_ℐ₁‖₂. For notational simplicity, let

\tilde{ε} = 2 \cdot {‖ \bar{θ} - θ ‖}_{\infty} = 2 \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty} / {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} .

(B.10)

Note that we have

max {\frac{{‖ θ_{ℐ_{3}} - {\bar{θ}}_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}}, \frac{{‖ θ_{ℐ_{1}} - {\bar{θ}}_{ℐ_{1}} ‖}_{2}}{\sqrt{s_{1}}}} \leq max {{‖ θ_{ℐ_{3}} - {\bar{θ}}_{ℐ_{3}} ‖}_{\infty}, {‖ θ_{ℐ_{1}} - {\bar{θ}}_{ℐ_{1}} ‖}_{\infty}} \leq {‖ \bar{θ} - θ ‖}_{\infty} = \tilde{ε} / 2,

which implies

\frac{{‖ {\bar{θ}}_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}} \geq \frac{{‖ θ_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}} - \frac{{‖ θ_{ℐ_{3}} - {\bar{θ}}_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}} \geq \frac{{‖ θ_{ℐ_{1}} ‖}_{2}}{\sqrt{s_{1}}} - \frac{{‖ θ_{ℐ_{3}} - {\bar{θ}}_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}} \geq \frac{{‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}}{\sqrt{s_{1}}} - \frac{{‖ {\bar{θ}}_{ℐ_{1}} - θ_{ℐ_{1}} ‖}_{2}}{\sqrt{s_{1}}} - \frac{{‖ θ_{ℐ_{3}} - {\bar{θ}}_{ℐ_{3}} ‖}_{2}}{\sqrt{s_{3}}} \geq \frac{{‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}}{\sqrt{s_{1}}} - \tilde{ε},

(B.11)

where second inequality is obtained from (B.9), while the first and third are from triangle inequality. Plugging (B.11) into (B.7), we obtain

Δ^{2} \leq 1 - {‖ {\bar{θ}}_{ℐ_{3}} ‖}_{2}^{2} \leq 1 - {(\sqrt{s_{3} / s_{1}} \cdot {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} - \sqrt{s_{3}} \cdot \tilde{ε})}^{2} .

Since by definition we have Δ = 〈θ̅, θ*〉 ∈ [−1, 1], solving for ‖θ̅_ℐ₁‖₂ in the above inequality yields

{‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \leq \sqrt{s_{1} / s_{3}} \cdot \sqrt{1 - Δ^{2}} + \sqrt{s_{1}} \cdot \tilde{ε} \leq \sqrt{s^{*} / ŝ} \cdot \sqrt{1 - Δ^{2}} + \sqrt{s^{*}} \cdot \tilde{ε} .

(B.12)

Here we employ the fact that s₁ ≤ s* and s₁/s₃ ≤ (s₁ + s₂)/(s₃ + s₂) = s*/ŝ, which follows from (B.5) and our assumption in (5.5) that s*/ŝ ≤ (1 − κ)²/[4·(1 + κ)²] < 1.

In the following, we prove that the right-hand side of (B.12) is upper bounded by Δ, i.e.,

\sqrt{s^{*} / ŝ} \cdot \sqrt{1 - Δ^{2}} + \sqrt{s^{*}} \cdot \tilde{ε} \leq Δ .

(B.13)

We can verify that a sufficient condition for (B.13) to hold is that

Δ \geq \frac{\sqrt{s^{*}} \cdot \tilde{ε} + {[s^{*} \cdot {\tilde{ε}}^{2} - (s^{*} / ŝ + 1) \cdot (s^{*} \cdot {\tilde{ε}}^{2} - s^{*} / ŝ)]}^{1 / 2}}{s^{*} / ŝ + 1} = \frac{\sqrt{s^{*}} \cdot \tilde{ε} + {[- {(s^{*} \cdot \tilde{ε})}^{2} / ŝ + (s^{*} / ŝ + 1) \cdot (s^{*} / ŝ)]}^{1 / 2}}{s^{*} / ŝ + 1},

(B.14)

which is obtained by solving for Δ in (B.13). When we are solving for Δ in (B.13), we use the fact that $\sqrt{s^{*}} \cdot \tilde{ε} \leq Δ$ , which holds because

\sqrt{s^{*}} \cdot \tilde{ε} \leq \sqrt{ŝ} \cdot \tilde{ε} = 2 \cdot \frac{\sqrt{ŝ} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty}}{{‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}} \leq \frac{1 - κ}{1 + κ} \leq Δ .

(B.15)

The first inequality is from our assumption in (5.5) that s*/ŝ ≤ (1 − κ)²/[4·(1 + κ)²] < 1. The equality is from the definition of ε̃ in (B.10). The second inequality follows from our assumption in (5.5) that

\sqrt{ŝ} \cdot {‖ β^{(t + 0.5)} - {\bar{β}}^{(t + 0.5)} ‖}_{\infty} \leq \frac{{(1 - κ)}^{2}}{2 \cdot (1 + κ)} \cdot {‖ β^{*} ‖}_{2}

and the first inequality in (B.2). To prove the last inequality in (B.15), we note that (B.1) implies

{‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}^{2} + {‖ β^{*} ‖}_{2}^{2} - 2 \cdot 〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉 = {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2} \leq κ^{2} \cdot {‖ β^{*} ‖}_{2}^{2} .

This together with (B.3) implies

Δ = 〈 \bar{θ}, θ^{*} 〉 = \frac{〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉}{{‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ β^{*} ‖}_{2}} \geq \frac{{‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}^{2} + {‖ β^{*} ‖}_{2}^{2} - κ^{2} \cdot {‖ β^{*} ‖}_{2}^{2}}{2 \cdot {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ β^{*} ‖}_{2}} \geq \frac{{(1 - κ)}^{2} + 1 - κ^{2}}{2 \cdot (1 + κ)} = \frac{1 - κ}{1 + κ},

(B.16)

where in the second inequality we use both sides of (B.2). In summary, we have that (B.15) holds. Now we verify that (B.14) holds. By (B.15) we have

\sqrt{ŝ} \cdot \tilde{ε} \leq \frac{1 - κ}{1 + κ} < 1 < \sqrt{(s^{*} + ŝ) / ŝ},

which implies $\tilde{ε} \leq \sqrt{s^{*} + ŝ} / ŝ$ . For the right-hand side of (B.14) we have

\frac{\sqrt{s^{*}} \cdot \tilde{ε} + {[- {(s^{*} \cdot \tilde{ε})}^{2} / ŝ + (s^{*} / ŝ + 1) \cdot (s^{*} / ŝ)]}^{1 / 2}}{s^{*} / ŝ + 1} \leq \frac{\sqrt{s^{*}} \cdot \tilde{ε} + {[(s^{*} / ŝ + 1) \cdot (s^{*} / ŝ)]}^{1 / 2}}{s^{*} / ŝ + 1} \leq 2 \cdot \sqrt{s^{*} / (s^{*} + ŝ)},

(B.17)

where the last inequality is obtained by plugging in $\tilde{ε} \leq \sqrt{s^{*} + ŝ} / ŝ$ . Meanwhile, note that we have

2 \cdot \sqrt{s^{*} / (s^{*} + ŝ)} \leq 2 \cdot \sqrt{1 / [1 + 4 \cdot {(1 + κ)}^{2} / {(1 - κ)}^{2}]} \leq (1 - κ) / (1 + κ) \leq Δ,

(B.18)

where the first inequality is from our assumption in (5.5) that s*/ŝ ≤ (1 − κ)²/[4·(1 + κ)²], while the last inequality is from (B.16). Combining (B.17) and (B.18), we then obtain (B.14). By (B.14) we further establish (B.13), i.e., the right-hand side of (B.12) is upper bounded by Δ, which implies

{‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \leq Δ .

(B.19)

Furthermore, according to (B.6) we have

Δ \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2} + {‖ {\bar{θ}}_{ℐ_{2}} ‖}_{2} \cdot {‖ θ_{ℐ_{2}}^{*} ‖}_{2} \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2} + \sqrt{1 - {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}^{2}} \cdot \sqrt{1 - {‖ θ_{ℐ_{1}}^{*} ‖}_{2}^{2}},

(B.20)

where in the last inequality we use the fact θ* and θ̅ are unit vectors. Now we solve for ${‖ θ_{ℐ_{1}}^{*} ‖}_{2}$ in (B.20). According to (B.19) and the fact that ${‖ θ_{ℐ_{1}}^{*} ‖}_{2} \leq {‖ θ^{*} ‖}_{2} = 1$ , on the right-hand side of (B.20) we have ${‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2} \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \leq Δ$ . Thus, we have

{(Δ - {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2})}^{2} \leq (1 - {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}^{2}) \cdot (1 - {‖ θ_{ℐ_{1}}^{*} ‖}_{2}^{2}) .

Further by solving for ${‖ θ_{ℐ_{1}}^{*} ‖}_{2}$ in the above inequality, we obtain

{‖ θ_{ℐ_{1}}^{*} ‖}_{2} \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot Δ + \sqrt{1 - {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2}^{2}} \cdot \sqrt{1 - Δ^{2}} \leq {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} + \sqrt{1 - Δ^{2}} \leq (1 + \sqrt{s^{*} / ŝ}) \cdot \sqrt{1 - Δ^{2}} + \sqrt{s^{*}} \cdot \tilde{ε},

(B.21)

where in the second inequality we use the fact that Δ ≤ 1, which follows from its definition, while in the last inequality we plug in (B.12). Then combining (B.12) and (B.21), we obtain

{‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2} \leq [\sqrt{s^{*} / ŝ} \cdot \sqrt{1 - Δ^{2}} + \sqrt{s^{*}} \cdot \tilde{ε}] \cdot [(1 + \sqrt{s^{*} / ŝ}) \cdot \sqrt{1 - Δ^{2}} + \sqrt{s^{*}} \cdot \tilde{ε}] .

(B.22)

Note that by (5.1) and the definition of θ̅ in (B.3), we have

{\bar{β}}^{(t + 1)} = trunc ({\bar{β}}^{(t + 0.5)}, {\hat{𝒮}}^{(t + 0.5)}) = trunc (\bar{θ}, {\hat{𝒮}}^{(t + 0.5)}) \cdot {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} .

Therefore, we have

〈 {\bar{β}}^{(t + 1)} / {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}, β^{*} / {‖ β^{*} ‖}_{2} 〉 = 〈 trunc (\bar{θ}, {\hat{𝒮}}^{(t + 0.5)}), θ^{*} 〉 = 〈 {\bar{θ}}_{ℐ_{2}}, θ_{ℐ_{2}}^{*} 〉 = 〈 \bar{θ}, θ^{*} 〉 - 〈 {\bar{θ}}_{ℐ_{1}}, θ_{ℐ_{1}}^{*} 〉 \geq 〈 \bar{θ}, θ^{*} 〉 - {‖ {\bar{θ}}_{ℐ_{1}} ‖}_{2} \cdot {‖ θ_{ℐ_{1}}^{*} ‖}_{2},

where the second and third equalities follow from (B.5). Let χ̅ = ‖β̅^(t+0.5)‖₂ · ‖β*‖₂. Plugging (B.22) into the right-hand side of the above inequality and then multiplying χ̅ on both sides, we obtain

〈 {\bar{β}}^{(t + 1)}, β^{*} 〉 \geq 〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉 - [\sqrt{s^{*} / ŝ} \cdot \sqrt{\bar{χ} \cdot (1 - Δ^{2})} + \sqrt{s^{*}} \cdot \sqrt{\bar{χ}} \cdot \tilde{ε}] \cdot [(1 + \sqrt{s^{*} / ŝ)} \cdot \sqrt{\bar{χ} \cdot (1 - Δ^{2})} + \sqrt{s^{*}} \cdot \sqrt{\bar{χ}} \cdot \tilde{ε}] = 〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉 - (\sqrt{s^{*} / ŝ} + s^{*} / ŝ) \cdot \bar{χ} \cdot (1 - Δ^{2}) - (1 + 2 \cdot \sqrt{s^{*} / ŝ}) \cdot \underset{(i)}{\underset{︸}{\sqrt{\bar{χ} \cdot (1 - Δ^{2})}}} \cdot \sqrt{s^{*}} \cdot \underset{(i i)}{\underset{︸}{\sqrt{\bar{χ}} \cdot \tilde{ε}}} - {(\sqrt{s^{*}} \cdot \sqrt{\bar{χ}} \cdot \tilde{ε})}^{2} .

(B.23)

For term (i) in (B.23), note that $\sqrt{1 - Δ^{2}} \leq \sqrt{2 \cdot (1 - Δ)}$ . By (B.3) and the definition that Δ = 〈θ̅, θ*〉, for term (i) we have

\sqrt{\bar{χ} \cdot (1 - Δ^{2})} \leq \sqrt{2 \cdot \bar{χ} \cdot (1 - Δ)} \leq \sqrt{2 \cdot {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ β^{*} ‖}_{2} - 2 \cdot 〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉} \leq \sqrt{{‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}^{2} + {‖ β^{*} ‖}_{2}^{2} - 2 \cdot 〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉} = {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} .

(B.24)

For term (ii) in (B.23), by the definition of ε̃ in (B.10) we have

\sqrt{\bar{χ}} \cdot \tilde{ε} = \sqrt{{‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ β^{*} ‖}_{2}} \cdot 2 \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty} / {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2} = 2 \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty} \cdot \sqrt{{‖ β^{*} ‖}_{2} / {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}} \leq \frac{2}{\sqrt{1 - κ}} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty},

(B.25)

where the last inequality is obtained from (B.2). Plugging (B.24) and (B.25) into (B.23), we obtain

〈 {\bar{β}}^{(t + 1)}, β^{*} 〉 \geq 〈 {\bar{β}}^{(t + 0.5)}, β^{*} 〉 - (\sqrt{s^{*} / ŝ} + s^{*} / ŝ) \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2} - (1 + 2 \cdot \sqrt{s^{*} / ŝ}) \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \cdot \frac{2 \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty} - \frac{4 \cdot s^{*}}{1 - κ} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty}^{2} .

(B.26)

Meanwhile, according to (5.1) we have that β̅^(t+1) is obtained by truncating β̅^(t+0.5), which implies

{‖ {\bar{β}}^{(t + 1)} ‖}_{2}^{2} + {‖ β^{*} ‖}_{2}^{2} \leq {‖ {\bar{β}}^{(t + 0.5)} ‖}_{2}^{2} + {‖ β^{*} ‖}_{2}^{2} .

(B.27)

Subtracting two times both sides of (B.26) from (B.27), we obtain

{‖ {\bar{β}}^{(t + 1)} - β^{*} ‖}_{2}^{2} \leq (1 + 2 \cdot \sqrt{s^{*} / ŝ} + 2 \cdot s^{*} / ŝ) \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2} + (1 + 2 \cdot \sqrt{s^{*} / ŝ}) \cdot \frac{4 \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} + \frac{8 \cdot s^{*}}{1 - κ} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty}^{2} .

We can easily verify that the above inequality implies

{‖ {\bar{β}}^{(t + 1)} - β^{*} ‖}_{2}^{2} \leq (1 + 2 \cdot \sqrt{s^{*} / ŝ} + 2 \cdot s^{*} / ŝ) \cdot {[{‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} + \frac{2 \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty}]}^{2} + \frac{8 \cdot s^{*}}{1 - κ} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty}^{2} .

Taking square roots of both sides and utilizing the fact that $\sqrt{a^{2} + b^{2}} \leq a + b (a, b > 0)$ , we obtain

{‖ {\bar{β}}^{(t + 1)} - β^{*} ‖}_{2} \leq {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} + \frac{C \cdot \sqrt{s^{*}}}{\sqrt{1 - κ}} \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{(t + 0.5)} ‖}_{\infty},

(B.28)

where C > 0 is an absolute constant. Here we utilize the fact that $s^{*} / ŝ \leq \sqrt{s^{*} / ŝ}$ and

1 + 2 \cdot \sqrt{s^{*} / ŝ} + 2 \cdot s^{*} / ŝ \leq 5,

both of which follow from our assumption that s*/ŝ ≤ (1 − κ)²/[4·(1 + κ)²] < 1 in (5.5). By (B.28) we conclude the proof of Lemma 5.1.

B.2 Proof of Lemma 5.2

In the following, we prove (5.8) and (5.9) for the maximization and gradient ascent implementation of the M-step correspondingly.

Proof of (5.8)

To prove (5.8) for the maximization implementation of the M-step (Algorithm 2), note that by the self-consistency property (McLachlan and Krishnan, 2007) we have

β^{*} = \underset{β}{argmax} Q (β; β^{*}) .

(B.29)

Hence, β* satisfies the following first-order optimality condition

〈 β - β^{*}, \nabla_{1} Q (β^{*}; β^{*}) 〉 \leq 0, for all β,

where ∇₁Q(·, ·) denotes the gradient taken with respect to the first variable. In particular, it implies

〈 {\bar{β}}^{(t + 0.5)} - β^{*}, \nabla_{1} Q (β^{*}; β^{*}) 〉 \leq 0 .

(B.30)

Meanwhile, by (5.1) and the definition of M(·) in (3.1), we have

{\bar{β}}^{(t + 0.5)} = M (β^{(t)}) = \underset{β}{argmax} Q (β; β^{(t)}) .

Hence we have the following first-order optimality condition

〈 β - {\bar{β}}^{(t + 0.5)}, \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{(t)}) 〉 \leq 0, for all β,

which implies

〈 β^{*} - {\bar{β}}^{(t + 0.5)}, \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{(t)}) 〉 \leq 0 .

(B.31)

Combining (B.30) and (B.31), we then obtain

〈 β^{*} - {\bar{β}}^{(t + 0.5)}, - \nabla_{1} Q (β^{*}; β^{*}) 〉 \leq 〈 β^{*} - {\bar{β}}^{(t + 0.5)}, - \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{(t)}) 〉,

which implies

〈 β^{*} - {\bar{β}}^{(t + 0.5)}, \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{*}) - \nabla_{1} Q (β^{*}; β^{*}) 〉 \leq 〈 β^{*} - {\bar{β}}^{(t + 0.5)}, \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{*}) - \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{(t)}) 〉 .

(B.32)

In the following, we establish upper and lower bounds for both sides of (B.32) correspondingly. By applying Condition Lipschitz-Gradient-1(γ₁, ℬ), for the right-hand side of (B.32) we have

〈 β^{*} - {\bar{β}}^{(t + 0.5)}, \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{*}) - \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{(t)}) 〉 \leq {‖ β^{*} - {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{*}) - \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{(t)}) ‖}_{2} \leq γ_{1} \cdot {‖ β^{*} - {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ β^{*} - β^{(t)} ‖}_{2},

(B.33)

where the last inequality is from (3.3). Meanwhile, for the left-hand side of (B.32), we have

Q ({\bar{β}}^{(t + 0.5)}; β^{*}) \leq Q (β^{*}; β^{*}) + 〈 \nabla_{1} Q (β^{*}; β^{*}), {\bar{β}}^{(t + 0.5)} - β^{*} 〉 - ν / 2 \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2},

(B.34)

Q (β^{*}; β^{*}) \leq Q ({\bar{β}}^{(t + 0.5)}; β^{*}) + 〈 \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{*}), β^{*} - {\bar{β}}^{(t + 0.5)} 〉 - ν / 2 \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2}

(B.35)

by (3.6) in Condition Concavity-Smoothness(μ, ν, ℬ). By adding (B.34) and (B.35), we obtain

ν \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2} \leq 〈 β^{*} - {\bar{β}}^{(t + 0.5)}, \nabla_{1} Q ({\bar{β}}^{(t + 0.5)}; β^{*}) - \nabla_{1} Q (β^{*}; β^{*}) 〉 .

(B.36)

Plugging (B.33) and (B.36) into (B.32), we obtain

ν \cdot {‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2}^{2} \leq γ_{1} \cdot {‖ β^{*} - {\bar{β}}^{(t + 0.5)} ‖}_{2} \cdot {‖ β^{*} - β^{(t)} ‖}_{2},

which implies (5.8) in Lemma 5.2.

Proof of (5.9)

We turn to prove (5.9). The self-consistency property in (B.29) implies that β* is the maximizer of Q(·; β*). Furthermore, (3.5) and (3.6) in Condition Concavity-Smoothness(μ, ν, ℬ) ensure that −Q(·; β*) is μ-smooth and ν-strongly convex. By invoking standard optimization results for minimizing strongly convex and smooth objective functions, e.g., in Nesterov (2004), for stepsize η = 2/(ν + μ), we have

{‖ β^{(t)} + η \cdot \nabla_{1} Q (β^{(t)}; β^{*}) - β^{*} ‖}_{2} \leq (\frac{μ - ν}{μ + ν}) \cdot {‖ β^{(t)} - β^{*} ‖}_{2},

(B.37)

i.e., the gradient ascent step decreases the distance to β* by a multiplicative factor. Hence, for the gradient ascent implementation of the M-step, i.e., M(·) defined in (3.2), we have

{‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} = {‖ M (β^{(t)}) - β^{*} ‖}_{2} = {‖ β^{(t)} + η \cdot \nabla_{1} Q (β^{(t)}; β^{(t)}) - β^{*} ‖}_{2} \leq {‖ β^{(t)} + η \cdot \nabla_{1} Q (β^{(t)}; β^{*}) - β^{*} ‖}_{2} + η \cdot {‖ \nabla_{1} Q (β^{(t)}; β^{*}) - \nabla_{1} Q (β^{(t)}; β^{(t)}) ‖}_{2} \leq (\frac{μ - ν}{μ + ν}) \cdot {‖ β^{(t)} - β^{*} ‖}_{2} + η \cdot γ_{2} \cdot {‖ β^{(t)} - β^{*} ‖}_{2},

(B.38)

where the last inequality is from (B.37) and (3.4) in Condition Lipschitz-Gradient-2(γ₂, ℬ). Plugging η = 2/(ν + μ) into (B.38), we obtain

{‖ {\bar{β}}^{(t + 0.5)} - β^{*} ‖}_{2} \leq (\frac{μ - ν + 2 \cdot γ_{2}}{μ + ν}) \cdot {‖ β^{(t)} - β^{*} ‖}_{2},

which implies (5.9). Thus, we conclude the proof of Lemma 5.2.

B.3 Auxiliary Lemma for Proving Theorem 3.4

The following lemma characterizes the initialization step in line 2 of Algorithm 4.

Lemma B.1

Suppose that we have ‖β^init − β*‖₂ ≤ κ · ‖β*‖₂ for some κ ∈ (0, 1). Assuming that ŝ ≥ 4 · (1 + κ)²/(1 − κ)² · s*, we have ${‖ β^{(0)} - β^{*} ‖}_{2} \leq {(1 + 4 \cdot \sqrt{s^{*} / ŝ})}^{1 / 2} \cdot {‖ β^{init} - β^{*} ‖}_{2}$ .

Proof

Following the same proof of Lemma 5.1 with both β̅^(t+0.5) and β^(t+0.5) replaced with β^init, β̅^(t+1) replaced with β⁽⁰⁾ and 𝒮̂^(t+0.5) replaced with 𝒮̂^init, we reach the conclusion.

B.4 Proof of Lemma 3.6

Proof

Recall that Q(·; ·) is the expectation of Q_n(·; ·). According to (A.1) and (3.1), we have

M (β) = 𝔼 [2 \cdot ω_{β} (Y) \cdot Y - Y]

with ω_β(·) being the weight function defined in (A.1), which together with (A.2) implies

M_{n} (β) - M (β) = \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (y_{i}) - 1] \cdot y_{i} - 𝔼 {[2 \cdot ω_{β} (Y) - 1] \cdot Y} .

(B.39)

Recall y_i is the i-th realization of Y, which follows the mixture distribution. For any u > 0, we have

𝔼 {exp [u \cdot {‖ M_{n} (β) - M (β) ‖}_{\infty}]} = 𝔼 {max_{j \in {1, \dots, d}} exp [u \cdot | {[M_{n} (β) - M (β)]}_{j} |]} \leq \sum_{j = 1}^{d} 𝔼 {\exp [u \cdot | {[M_{n} (β) - M (β)]}_{j} |]} .

(B.40)

Based on (B.39), we apply the symmetrization result in Lemma D.4 to the right-hand side of (B.40). Then we have

𝔼 {exp [u \cdot {‖ M_{n} (β) - M (β) ‖}_{\infty}]} \leq \sum_{j = 1}^{d} 𝔼 {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot [2 \cdot ω_{β} (y_{i}) - 1] \cdot y_{i, j} |]},

(B.41)

where ξ₁, …, ξ_n are i.i.d. Rademacher random variables that are independent of y₁, …, y_n. Then we invoke the contraction result in Lemma D.5 by setting

f (y_{i, j}) = y_{i, j}, ℱ = {f}, ψ_{i} (υ) = [2 \cdot ω_{β} (y_{i}) - 1] \cdot υ, and ϕ (υ) = exp (u \cdot υ),

where u is the variable of the moment generating function in (B.40). From the definition of ω_β(·) in (A.1) we have |2 · ω_β(y_i) − 1| ≤ 1, which implies

| ψ_{i} (υ) - ψ_{i} (υ') | \leq | [2 \cdot ω_{β} (y_{i}) - 1] \cdot (υ - υ') | \leq | υ - υ' |, for all υ, υ' \in ℝ .

Therefore, by Lemma D.5 we obtain

𝔼 {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot [2 \cdot ω_{β} (y_{i}) - 1] \cdot y_{i, j} |]} \leq 𝔼 {exp [2 \cdot u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} |]}

(B.42)

for the right-hand side of (B.41), where j ∈ {1, …, d}. Here note that in Gaussian mixture model we have $y_{i, j} = z_{i} \cdot β_{j}^{*} + υ_{i, j}$ , where z_i is a Rademacher random variable and υ_i,j ~ N(0, σ²). Therefore, according to Example 5.8 in Vershynin (2010) we have ${‖ z_{i} \cdot β_{j}^{*} ‖}_{ψ_{2}} \leq | β_{j}^{*} |$ and ‖υ_i,j‖_ψ₂ ≤ C · σ. Hence by Lemma D.1 we have

{‖ y_{i, j} ‖}_{ψ_{2}} = {‖ z_{i} \cdot β_{j}^{*} + υ_{i, j} ‖}_{ψ_{2}} \leq C \cdot \sqrt{{| β_{j}^{*} |}^{2} + C' \cdot σ^{2}} \leq C \cdot \sqrt{{‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}} .

Since |ξ_i · y_i,j| = |y_i,j|, ξ_i · y_i,j and y_i,j have the same ψ₂-norm. Because ξ_i is a Rademacher random variable independent of y_i,j, we have 𝔼(ξ_i · y_i,j) = 0. By Lemma 5.5 in Vershynin (2010), we obtain

𝔼 [exp (u' \cdot ξ_{i} \cdot y_{i, j})] \leq exp [{(u')}^{2} \cdot C \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})], for all u' \in ℝ .

(B.43)

Hence, for the right-hand side of (B.42) we have

𝔼 {exp [2 \cdot u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} |]} \leq 𝔼 (max {exp [2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j}], exp [- 2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j}]}) \leq 𝔼 {exp [2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j}]} + 𝔼 {exp [- 2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j}]} \leq 2 \cdot exp [C \cdot u^{2} \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}) / n] .

(B.44)

Here the last inequality is obtained by plugging (B.43) with u′ = 2 · u/n and u′ = −2 · u/n respectively into the two terms. Plugging (B.44) into (B.42) and then into (B.41), we obtain

𝔼 {exp [u \cdot {‖ M_{n} (β) - M (β) ‖}_{\infty}]} \leq 2 \cdot d \cdot exp [C \cdot u^{2} \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}) / n] .

By Chernoff bound we have that, for all u > 0 and υ > 0,

ℙ [{‖ M_{n} (β) - M (β) ‖}_{\infty} > υ] \leq 𝔼 {exp [u \cdot {‖ M_{n} (β) - M (β) ‖}_{\infty}]} / exp (u \cdot υ) \leq 2 \cdot exp [C \cdot u^{2} \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}) / n - u \cdot υ + log d] .

Minimizing the right-hand side over u we obtain

ℙ [{‖ M_{n} (β) - M (β) ‖}_{\infty} > υ] \leq 2 \cdot exp {- n \cdot υ^{2} / [4 \cdot C \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})] + log d} .

Setting the right-hand side to be δ, we have that

υ = C \cdot {({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})}^{1 / 2} \cdot \sqrt{\frac{log d + log (2 / δ)}{n}} \leq C ″ \cdot ({‖ β^{*} ‖}_{\infty} + σ) \cdot \sqrt{\frac{log d + log (2 / δ)}{n}}

holds for some absolute constants C, C′ and C″, which completes the proof of Lemma 3.6.

B.5 Proof of Lemma 3.9

In the sequel, we first establish the result for the maximization implementation of the M-step and then for the gradient ascent implementation.

Maximization Implementation

For the maximization implementation we need to estimate the inverse covariance matrix Θ* = Σ⁻¹ with the CLIME estimator Θ̂ defined in (A.5). The following lemma from Cai et al. (2011) quantifies the statistical rate of convergence of Θ̂. Recall that ‖·‖_1,∞ is defined as the maximum of the row ℓ₁-norms of a matrix.

Lemma B.2

For Σ = I_d and $λ^{CLIME} = C \cdot \sqrt{log d / n}$ in (A.5), we have that

{‖ \hat{Θ} - Θ^{*} ‖}_{1, \infty} \leq C' \cdot \sqrt{\frac{log d + log (1 / δ)}{n}}

holds with probability at least 1 − ∞, where C and C′ are positive absolute constants.

Proof

See the proof of Theorem 6 in Cai et al. (2011) for details.

Now we are ready to prove (3.21) of Lemma 3.9.

Proof

Recall that Q(·; ·) is the expectation of Q_n(·; ·). According to (A.3) and (3.1), we have

M (β) = 𝔼 {[2 \cdot ω_{β} (X, Y) - 1] \cdot Y \cdot X}

(B.45)

with ω_β(·, ·) being the weight function defined in (A.3), which together with (A.6) implies

M_{n} (β) - M (β) = \hat{Θ} \cdot \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) - 1] \cdot y_{i} \cdot x_{i} - 𝔼 {[2 \cdot ω_{β} (X, Y) - 1] \cdot Y \cdot X} .

Here Θ̂ is the CLIME estimator defined in (A.5). For notational simplicity, we denote

{\bar{ω}}_{i} = 2 \cdot ω_{β} (x_{i}, y_{i}) - 1, and \bar{ω} = 2 \cdot ω_{β} (X, Y) - 1 .

(B.46)

It is worth noting that both ω̅_i and ω̅ depend on β. Note that we have

{‖ M_{n} (β) - M (β) ‖}_{\infty} \leq {‖ \hat{Θ} \cdot [\frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X)] ‖}_{\infty} + {‖ (\hat{Θ} - I_{d}) \cdot 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty} \leq \underset{(i)}{\underset{︸}{{‖ \hat{Θ} ‖}_{1, \infty}}} \cdot \underset{(i i)}{\underset{︸}{{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty}}} + \underset{(i i i)}{\underset{︸}{{‖ \hat{Θ} - I_{d} ‖}_{1, \infty}}} \cdot \underset{(i v)}{\underset{︸}{{‖ 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty}}} .

(B.47)

Analysis of Term (i)

For term (i) in (B.47), recall that by our model assumption we have Σ = I_d, which implies Θ* = Σ⁻¹ = I_d. By Lemma B.2, for a sufficiently large sample size n, we have that

{‖ \hat{Θ} ‖}_{1, \infty} \leq {‖ \hat{Θ} - I_{d} ‖}_{1, \infty} + {‖ I_{d} ‖}_{1, \infty} \leq 1 / 2 + 1 = 3 / 2

(B.48)

holds with probability at least 1 − δ/4.

Analysis of Term (ii)

For term (ii) in (B.47), we have that for u > 0,

𝔼 {exp [u \cdot {‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty}]} = 𝔼 {max_{j \in {1, \dots, d}} exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i, j} - 𝔼 (\bar{ω} \cdot Y \cdot X_{j}) |]} \leq \sum_{j = 1}^{d} 𝔼 {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i, j} - 𝔼 (\bar{ω} \cdot Y \cdot X_{j}) |]} \leq \sum_{j = 1}^{d} 𝔼 {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i, j} |]},

(B.49)

where ξ₁,…, ξ_n are i.i.d. Rademacher random variables. The last inequality follows from Lemma D.4. Furthermore, for the right-hand side of (B.49), we invoke the contraction result in Lemma D.5 by setting

f (y_{i} \cdot x_{i, j}) = y_{i} \cdot x_{i, j}, ℱ = {f}, ψ_{i} (υ) = {\bar{ω}}_{i} \cdot υ, and ϕ (υ) = exp (u \cdot υ),

where u is the variable of the moment generating function in (B.49). From the definitions in (A.3) and (B.46) we have |ω̅_i| = |2 · ω_β(x_i, y_i) − 1 | ≤ 1, which implies

| ψ_{i} (υ) - ψ_{i} (υ') | \leq | [2 \cdot ω_{β} (x_{i}, y_{i}) - 1] \cdot (υ - υ') | \leq | υ - υ' |, for all υ, υ' \in ℝ .

By Lemma D.5, we obtain

𝔼 {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i, j} |]} \leq 𝔼 {exp [2 \cdot u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i} \cdot x_{i, j} |]}

(B.50)

for j ∈ {1,…, d} on the right-hand side of (B.49). Recall that in mixture of regression model we have y_i = z_i · 〈β*, x_i〉 + υ_i, where z_i is a Rademacher random variable, υ_i ~ N(0, σ²), and x_i ~ N(0, I_d). Then by Example 5.8 in Vershynin (2010) we have ‖z_i · 〈β*, x_i〉‖_ψ₂ = ‖〈β*, x_i〉‖_ψ₂ ≤ C · ‖β*‖₂ and ‖υ_i,j‖_ψ₂ ≤ C′ · σ. By Lemma D.1 we further have

{‖ y_{i} ‖}_{ψ_{2}} = {‖ z_{i} \cdot 〈 β^{*}, x_{i} 〉 + υ_{i} ‖}_{ψ_{2}} \leq \sqrt{C \cdot {‖ β^{*} ‖}_{2}^{2} + C' \cdot σ^{2}} .

Note that we have ‖x_i,j‖_ψ₂ ≤ C″ since x_i,j ~ N(0, 1). Therefore, by Lemma D.2 we have

{‖ ξ_{i} \cdot y_{i} \cdot x_{i, j} ‖}_{ψ_{1}} = {‖ y_{i} \cdot x_{i, j} ‖}_{ψ_{1}} \leq max {C \cdot {‖ β^{*} ‖}_{2}^{2} + C' \cdot σ^{2}, C ″} \leq C ‴ \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} .

Since ξ_i is a Rademacher random variable independent of y_i · x_i,j, we have 𝔼(ξ_i ·y_i ·x_i,j) = 0. Hence, by Lemma 5.15 in Vershynin (2010), we obtain

𝔼 [exp (u' \cdot ξ_{i} \cdot y_{i} \cdot x_{i, j})] \leq exp [C \cdot {(u')}^{2} \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}^{2}]

(B.51)

for all |u′| ≤ C′/ max { ${‖ β^{*} ‖}_{2}^{2} + σ^{2}$ , 1}. Hence we have

𝔼 {exp [2 \cdot u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i} \cdot x_{i, j} |]} \leq 𝔼 (max {exp [2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i} \cdot x_{i, j}], exp [- 2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i} \cdot x_{i, j}]}) \leq 𝔼 {exp [2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i} \cdot x_{i, j}]} + 𝔼 {exp [- 2 \cdot u \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i} \cdot x_{i, j}]} \leq 2 \cdot exp [C \cdot u^{2} \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}^{2} / n] .

(B.52)

The last inequality is obtained by plugging (B.51) with u′ = 2·u/n and u′ = −2·u/n correspondingly into the two terms. Here $| u | \leq C' \cdot n / max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}$ . Plugging (B.52) into (B.50) and further into (B.49), we obtain

𝔼 {exp [u \cdot {‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty}]} \leq 2 \cdot d \cdot exp [C \cdot u^{2} \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}^{2} / n] .

By Chernoff bound we have that, for all υ > 0 and $| u | \leq C' \cdot n / max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}$ ,

ℙ [{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty} > υ] \leq 𝔼 {exp [u \cdot {‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty}]} / exp (u \cdot υ) \leq 2 \cdot exp [C \cdot u^{2} \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}^{2} / n - u \cdot υ + log d] .

Minimizing over u on its right-hand side we have that, for $0 < υ \leq C ″ \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}$ ,

ℙ [{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty} > υ] \leq 2 \cdot exp {- n \cdot υ^{2} / [4 \cdot C \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1}^{2}] + log d} .

Setting the right-hand side of the above inequality to be δ/2, we have that

{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{ω}}_{i} \cdot y_{i} \cdot x_{i} - 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty} \leq υ = C \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}

(B.53)

holds with probability at least 1 − δ/2 for a sufficiently large n.

Analysis of Term (iii)

For term (iii) in (B.47), by Lemma B.2 we have

{‖ \hat{Θ} - I_{d} ‖}_{1, \infty} \leq C \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}

(B.54)

with probability at least 1 − δ/4 for a sufficiently large n.

Analysis of Term (iv)

For term (iv) in (B.47), recall that by (B.45) and (B.46) we have

M (β) = 𝔼 {[2 \cdot ω_{β} (X, Y) - 1] \cdot Y \cdot X} = 𝔼 (\bar{ω} \cdot Y \cdot X),

which implies

{‖ 𝔼 (\bar{ω} \cdot Y \cdot X) ‖}_{\infty} = {‖ M (β) ‖}_{\infty} \leq {‖ M (β) - β^{*} ‖}_{2} + {‖ β^{*} ‖}_{2} \leq {‖ β - β^{*} ‖}_{2} + {‖ β^{*} ‖}_{2} \leq (1 + 1 / 32) \cdot {‖ β^{*} ‖}_{2},

(B.55)

where the first inequality follows from triangle inequality and ‖·‖_∞ ≤ ‖·‖₂, the second inequality is from the proof of (5.8) in Lemma 5.2 with β̅^(t+0.5) replaced with β and the fact that γ₁/ν < 1 in (5.8), and the third inequality holds since in Condition Statistical-Error(ε, δ, s, n, ℬ) we suppose that ℬ ∈ ℬ, and for mixture of regression model ℬ is specified in (3.20).

Plugging (B.48), (B.53), (B.54) and (B.55) into (B.47), by union bound we have that

{‖ M_{n} (β) - M (β) ‖}_{\infty} \leq C \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}} + C' \cdot {‖ β^{*} ‖}_{2} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}} \leq C ″ \cdot [max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} + {‖ β^{*} ‖}_{2}] \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}

holds with probability at least 1 − δ. Therefore, we conclude the proof of (3.21) in Lemma 3.9.

Gradient Ascent Implementation

In the following, we prove (3.22) in Lemma 3.9.

Proof

Recall that Q(·; ·) is the expectation of Q_n(·; ·). According to (A.3) and (3.2), we have

M (β) = β + η \cdot 𝔼 [2 \cdot ω_{β} (X, Y) \cdot Y \cdot X - β]

with ω_β(·, ·) being the weight function defined in (A.3), which together with (A.7) implies

{‖ M_{n} (β) - M (β) ‖}_{\infty} = {‖ η \cdot \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) \cdot y_{i} \cdot x_{i} - x_{i} \cdot x_{i}^{⊤} \cdot β] - η \cdot 𝔼 [2 \cdot ω_{β} (X, Y) \cdot Y \cdot X - β] ‖}_{\infty} \leq η \cdot \underset{(i)}{\underset{︸}{{‖ \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) \cdot y_{i} \cdot x_{i}] - 𝔼 [2 \cdot ω_{β} (X, Y) \cdot Y \cdot X] ‖}_{\infty}}} + η \cdot \underset{(i i)}{\underset{︸}{{‖ \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} \cdot β - β ‖}_{\infty}}} .

(B.56)

Here η > 0 denotes the stepsize in Algorithm 3.

Analysis of Term (i)

For term (i) in (B.56), we redefine ω̅_i and ω̅ in (B.46) as

{\bar{ω}}_{i} = 2 \cdot ω_{β} (x_{i}, y_{i}), and \bar{ω} = 2 \cdot ω_{β} (X, Y) .

(B.57)

Note that |ω̅_i| = |2 · ω_β(x_i, y_i)| ≤ 2. Following the same way we establish the upper bound of term (ii) in (B.47), under the new definitions of ω̅_i and ω̅ in (B.57) we have that

{‖ \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) \cdot y_{i} \cdot x_{i}] - 𝔼 [2 \cdot ω_{β} (X, Y) \cdot Y \cdot X] ‖}_{\infty} \leq C \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}

holds with probability at least 1 − δ/2.

Analysis of Term (ii)

For term (ii) in (B.56), we have

{‖ \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} \cdot β - β ‖}_{\infty} \leq \underset{(i i) . a}{\underset{︸}{{‖ \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} - I_{d} ‖}_{\infty, \infty}}} \cdot \underset{(i i) . b}{\underset{︸}{{‖ β ‖}_{1}}} .

For term (ii).a, recall by our model assumption we have 𝔼(X·X^⊤) = I_d and x_i’s are the independent realizations of X. Hence we have

{‖ \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} - I_{d} ‖}_{\infty, \infty} \leq max_{j \in {1, \dots, d}} max_{k \in {1, \dots, d}} | \frac{1}{n} \sum_{i = 1}^{n} x_{i, j} \cdot x_{i, k} - 𝔼 (X_{j} \cdot X_{k}) | .

Since X_j, X_k ~ N(0, 1), according to Example 5.8 in Vershynin (2010) we have ‖X_j‖_ψ₂ = ‖X_k‖_ψ₂ ≤ C. By Lemma D.2, X_j ·X_k is a sub-exponential random variable with ‖X_j ·X_k‖_ψ₁ ≤ C′. Moreover, we have ‖X_j ·X_k − 𝔼(X_j · X_k)‖_ψ₁ ≤ C″ by Lemma D.3. Then by Bernstein’s inequality (Proposition 5.16 in Vershynin (2010)) and union bound, we have

ℙ [{‖ \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} - I_{d} ‖}_{\infty, \infty} > υ] \leq 2 \cdot d^{2} \cdot exp (- C \cdot n \cdot υ^{2})

for 0 < υ ≤ C′ and a sufficiently large sample size n. Setting its right-hand side to be δ/2, we have

{‖ \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤} - I_{d} ‖}_{\infty, \infty} \leq C ″ \cdot \sqrt{\frac{2 \cdot log d + log (4 / δ)}{n}}

holds with probability at least 1 − δ/2. For term (ii).b we have ${‖ β ‖}_{1} \leq \sqrt{s} \cdot {‖ β ‖}_{2}$ , since in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume ‖β‖₀ ≤ s. Furthermore, we have ‖β‖₂ ≤ ‖β*‖₂ + ‖β* − β‖₂ ≤ (1 + 1/32) · ‖β*‖₂, because in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume that ℬ ∈ ℬ, and for mixture of regression model ℬ is specified in (3.20).

Plugging the above results into the right-hand side of (B.56), by union bound we have that

{‖ M_{n} (β) - M (β) ‖}_{\infty} \leq η \cdot C \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}} + η \cdot C' \cdot \sqrt{\frac{2 \cdot log d + log (4 / δ)}{n}} \cdot \sqrt{s} \cdot {‖ β^{*} ‖}_{2} \leq η \cdot C ″ \cdot max {{‖ β^{*} ‖}_{2}^{2} + σ^{2}, 1, \sqrt{s} \cdot {‖ β^{*} ‖}_{2}} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}

holds with probability at least 1 − δ. Therefore, we conclude the proof of (3.22) in Lemma 3.9.

B.6 Proof of Lemma 3.12

Proof

Recall that Q(·; ·) is the expectation of Q_n(·; ·). Let X^obs be the subvector corresponding to the observed component of X in (2.26). By (A.8) and (3.2), we have

M (β) = β + η \cdot 𝔼 [K_{β} (X^{obs}, Y) \cdot β + m_{β} (X^{obs}, Y) \cdot Y]

with η > 0 being the stepsize in Algorithm 3, which together with (A.11) implies

{‖ M_{n} (β) - M (β) ‖}_{\infty} = η \cdot {‖ {\frac{1}{n} \sum_{i = 1}^{n} K_{β} (x_{i}^{obs}, y_{i}) - 𝔼 [K_{β} (X^{obs}, Y)]} \cdot β + \frac{1}{n} \sum_{i = 1}^{n} m_{β} (x_{i}^{obs}, y_{i}) \cdot y_{i} - 𝔼 [m_{β} (X^{obs}, Y) \cdot Y] ‖}_{\infty} \leq η \cdot \underset{(i)}{\underset{︸}{{‖ \frac{1}{n} \sum_{i = 1}^{n} K_{β} (x_{i}^{obs}, y_{i}) - 𝔼 [K_{β} (X^{obs}, Y)] ‖}_{\infty, \infty}}} \cdot \underset{(i i)}{\underset{︸}{{‖ β ‖}_{1}}} + η \cdot \underset{(i i i)}{\underset{︸}{{‖ \frac{1}{n} \sum_{i = 1}^{n} m_{β} (x_{i}^{obs}, y_{i}) \cdot y_{i} - 𝔼 [m_{β} (X^{obs}, Y) \cdot Y] ‖}_{\infty}}} .

(B.58)

Here K_β(·, ·) ∈ ℝ^d×d and m_β(·, ·) ∈ ℝ^d are defined in (A.9) and (A.10). To ease the notation, let

K_{β} (x_{i}^{obs}, y_{i}) = {\bar{K}}^{i}, K_{β} (X^{obs}, Y) = \bar{K}, m_{β} (x_{i}^{obs}, y_{i}) = {\bar{m}}^{i}, m_{β} (X^{obs}, Y) = \bar{m} .

(B.59)

Let the entries of Z∈ℝ^d be i.i.d. Bernoulli random variables, each of which is zero with probability p_m, and z₁,…, z_n be the n independent realizations of Z. If z_i,j = 1, then x_i,j is observed, otherwise x_i,j is missing (unobserved).

Analysis of Term (i)

For term (i) in (B.58), recall that we have

\bar{K} = diag (1 - Z) + \bar{m} \cdot {\bar{m}}^{⊤} - [(1 - Z) ⊙ \bar{m}] \cdot {[(1 - Z) ⊙ \bar{m}]}^{⊤},

where diag(1 − Z) denotes the d × d diagonal matrix with the entries of 1 − Z ∈ ℝ^d on its diagonal, and ⊙ denotes the Hadamard product. Therefore, by union bound we have

ℙ [{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{K}}^{i} - 𝔼 \bar{K} ‖}_{\infty, \infty} > υ] \leq \sum_{j = 1}^{d} \sum_{k = 1}^{d} {ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {[diag (1 - z_{i})]}_{j, k} - 𝔼 {[diag (1 - z)]}_{j, k} | > υ / 3] + ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{m}}_{j}^{i} \cdot {\bar{m}}_{k}^{i} - 𝔼 ({\bar{m}}_{j} \cdot {\bar{m}}_{k}) | > υ / 3]} + ℙ [| \frac{1}{n} \sum_{i = 1}^{n} (1 - z_{i, j}) \cdot {\bar{m}}_{j}^{i} \cdot (1 - z_{i, k}) \cdot {\bar{m}}_{k}^{i} - 𝔼 [(1 - Z_{j}) \cdot {\bar{m}}_{j} \cdot (1 - Z_{k}) \cdot {\bar{m}}_{k}] | > υ / 3]} .

(B.60)

According to Lemma B.3 we have, for all j ∈ {1,…, d}, m̅_j is a zero-mean sub-Gaussian random variable with ‖m̅_j‖_ψ₂ ≤ C · (1 + κ · r). Then by Lemmas D.2 and D.3 we have

{‖ {\bar{m}}_{j} \cdot {\bar{m}}_{k} - 𝔼 ({\bar{m}}_{j} \cdot {\bar{m}}_{k}) ‖}_{ψ_{1}} \leq 2 \cdot {‖ {\bar{m}}_{j} \cdot {\bar{m}}_{k} ‖}_{ψ_{1}} \leq C' \cdot max {{‖ {\bar{m}}_{j} ‖}_{ψ_{2}}^{2}, {‖ {\bar{m}}_{k} ‖}_{ψ_{2}}^{2}} \leq C ″ \cdot {(1 + κ \cdot r)}^{2} .

Meanwhile, since |1 − Z_j | ≤ 1, it holds that ‖(1 − Z_j) · m̅_j‖_ψ₂ ≤ ‖m̅_j‖_ψ₂ ≤ C · (1 + κ · r). Similarly, by Lemmas D.2 and D.3 we have

{‖ (1 - Z_{j}) \cdot {\bar{m}}_{j} \cdot (1 - Z_{k}) \cdot {\bar{m}}_{k} - 𝔼 [(1 - Z_{j}) \cdot {\bar{m}}_{j} \cdot (1 - Z_{k}) \cdot {\bar{m}}_{k}] ‖}_{ψ_{1}} \leq C ″ \cdot {(1 + κ \cdot r)}^{2} .

In addition, for the first term on the right-hand side of (B.60) we have

{‖ {[diag (1 - Z)]}_{j, k} - 𝔼 {[diag (1 - Z)]}_{j, k} ‖}_{ψ_{2}} \leq 2 \cdot {‖ {[diag (1 - Z)]}_{j, k} ‖}_{ψ_{2}} \leq 2,

where the first inequality is from Lemma D.3 and the second inequality follows from Example 5.8 in Vershynin (2010) since [diag(1−Z)]_j,k ∈ [0, 1]. Applying Hoeffding’s inequality to the first term on the right-hand side of (B.60) and Bernstein’s inequality to the second and third terms, we obtain

ℙ [{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{K}}^{i} - 𝔼 \bar{K} ‖}_{\infty, \infty} > υ] \leq d^{2} \cdot 2 \cdot exp (- C \cdot n \cdot υ^{2}) + d^{2} \cdot 4 \cdot exp [- C' \cdot n \cdot υ^{2} / {(1 + κ \cdot r)}^{4}] .

Setting the two terms on the right-hand side of the above inequality to be δ/6 and δ/3 respectively, we have that

{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{K}}^{i} - 𝔼 \bar{K} ‖}_{\infty, \infty} \leq υ = C ″ \cdot max {{(1 + κ \cdot r)}^{2},; 1} \cdot \sqrt{\frac{log d + log (12 / δ)}{n}} = C ″ \cdot {(1 + κ \cdot r)}^{2} \cdot \sqrt{\frac{log d + log (12 / δ)}{n}}

holds with probability at least 1 − δ/2, for sufficiently large constant C″ and sample size n.

Analysis of Term (ii)

For term (ii) in (B.58) we have

{‖ β ‖}_{1} \leq \sqrt{s} \cdot {‖ β ‖}_{2} \leq \sqrt{s} \cdot {‖ β^{*} ‖}_{2} + \sqrt{s} \cdot {‖ β^{*} - β ‖}_{2} \leq \sqrt{s} \cdot (1 + κ) \cdot {‖ β^{*} ‖}_{2},

where the first inequality holds because in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume ‖β‖₀ ≤ s, while the last inequality holds since in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume that β ∈ ℬ, and for regression with missing covariates ℬ is specified in (3.27).

Analysis of Term (iii)

For term (iii) in (B.58), by union bound we have

ℙ [{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{m}}^{i} \cdot y_{i} - 𝔼 (\bar{m} \cdot Y) ‖}_{\infty} > υ] \leq \sum_{j = 1}^{d} ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{m}}_{j}^{i} \cdot y_{i} - 𝔼 ({\bar{m}}_{j} \cdot Y) | > υ] \leq d \cdot 2 \cdot exp [C \cdot n \cdot υ^{2} / max {{(1 + κ \cdot r)}^{4}, {(σ^{2} + {‖ β^{*} ‖}_{2}^{2})}^{2}}]

(B.61)

for a sufficiently large n. Here the second inequality is from Bernstein’s inequality, since we have

{‖ {\bar{m}}_{j} \cdot Y - 𝔼 ({\bar{m}}_{j} \cdot Y) ‖}_{ψ_{1}} \leq C' \cdot max {{‖ {\bar{m}}_{j} ‖}_{ψ_{2}}^{2}, {‖ Y ‖}_{ψ_{2}}^{2}} \leq C ″ \cdot max {{(1 + κ \cdot r)}^{2}, σ^{2} + {‖ β^{*} ‖}_{2}^{2}} .

Here the first inequality follows from Lemmas D.2 and D.3 and the second inequality is from Lemma B.3 and the fact that $Y = 〈 X, β^{*} 〉 + V ~ N (0, {‖ β^{*} ‖}_{2}^{2} + σ^{2})$ , because X ~ N(0, I_d) and V ~ N(0, σ²) are independent. Setting the right-hand side of (B.61) to be δ/2, we have that

{‖ \frac{1}{n} \sum_{i = 1}^{n} {\bar{m}}^{i} \cdot y_{i} - 𝔼 (\bar{m} \cdot Y) ‖}_{\infty} \leq υ = C \cdot max {(1 + κ \cdot r), σ^{2} + {‖ β^{*} ‖}_{2}^{2}} \cdot \sqrt{\frac{log d + log (4 / δ)}{n}}

holds with probability at least 1 − δ/2 for sufficiently large constant C and sample size n.

Finally, plugging in the upper bounds for terms (i)–(iii) in (B.58) we have that

{‖ M_{n} (β) - M (β) ‖}_{\infty} \leq C \cdot η \cdot [\sqrt{s} \cdot {‖ β^{*} ‖}_{2} \cdot (1 + κ) \cdot {(1 + κ \cdot r)}^{2} + max {{(1 + κ \cdot r)}^{2}, σ^{2} + {‖ β^{*} ‖}_{2}^{2}}] \cdot \sqrt{\frac{log d + log (12 / δ)}{n}}

holds with probability at least 1 − δ. Therefore, we conclude the proof of Lemma 3.12.

The following auxiliary lemma used in the proof of Lemma 3.12 characterizes the sub-Gaussian property of m̅ defined in (B.59).

Lemma B.3

Under the assumptions of Lemma 3.12, the random vector m̅ ∈ ℝ^d defined in (B.59) is sub-Gaussian with mean zero and ‖m̅_j‖_ψ₂ ≤ C · (1 + κ · r) for all j ∈ {1,…, d}, where C > 0 is a sufficiently large absolute constant.

Proof

In the proof of Lemma 3.12, Z’s entries are i.i.d. Bernoulli random variables, which satisfy ℙ(Z_j = 0) = p_m for all j ∈ {1,…, d}. Meanwhile, from the definitions in (A.9) and (B.59), we have

\bar{m} = Z ⊙ X + \frac{Y - 〈 β, Z ⊙ X 〉}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} \cdot (1 - Z) ⊙ β .

Since we have Y = 〈β*, X〉 + V and − 〈β, Z ⊙ X〉 = − 〈β, X〉 + 〈β, (1 − Z) ⊙ X〉, it holds that

{\bar{m}}_{j} = \underset{(i)}{\underset{︸}{Z_{j} \cdot X_{j}}} + \underset{(i i)}{\underset{︸}{\frac{V \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}}}} + \underset{(i i i)}{\underset{︸}{\frac{〈 β^{*} - β, X 〉 \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}}}} + \underset{(i v)}{\underset{︸}{\frac{〈 β, (1 - Z) ⊙ X 〉 \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}}}} .

(B.62)

Since X ~ N(0, I_d) is independent of Z, we can verify that 𝔼(m̅_j) = 0. In the following we provide upper bounds for the ψ₂-norms of terms (i)–(iv) in (B.62).

Analysis of Terms (i) and (ii)

For term (i), since X_j ~ N(0, 1), we have ‖Z_j · X_j‖_ψ₂ ≤ ‖X_j‖_ψ₂ ≤ C, where the last inequality follows from Example 5.8 in Vershynin (2010). Meanwhile, for term (ii) in (B.62), we have that for any u > 0,

ℙ [| \frac{V \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} | > u] \leq ℙ [\frac{| V | \cdot {‖ (1 - Z) ⊙ β ‖}_{2}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} > u] = 𝔼_{Z} {ℙ [\frac{| V | \cdot {‖ (1 - Z) ⊙ β ‖}_{2}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} > u | Z]} \leq 𝔼_{Z} {2 \cdot exp (- C \cdot u^{2} \cdot {[σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}]}^{2} / [σ^{2} \cdot {‖ (1 - Z) ⊙ β ‖}_{2}^{2}])} \leq 𝔼_{Z} {2 \cdot exp (- C' \cdot u^{2})} = 2 \cdot exp (- C' \cdot u^{2}),

(B.63)

where the second last inequality is from the fact that ‖V‖_ψ₂ ≤ C″ · σ, while the last inequality holds because we have

{[σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}]}^{2} \geq σ^{2} \cdot {‖ (1 - Z) ⊙ β ‖}_{2}^{2} .

(B.64)

According to Lemma 5.5 in Vershynin (2010), by (B.63) we conclude that the ψ₂-norm of term (ii) in (B.62) is upper bounded by some absolute constant C > 0.

Analysis of Term (iii)

For term (iii), by the same conditioning argument in (B.63), we have

ℙ [| \frac{〈 β^{*} - β, X 〉 \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} | > u] \leq 𝔼_{Z} {2 \cdot exp (- C \cdot u^{2} \cdot {[σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}]}^{2} / [{‖ β^{*} - β ‖}_{2}^{2} \cdot {‖ (1 - Z) ⊙ β ‖}_{2}^{2}])},

(B.65)

where we utilize the fact that ‖〈β* − β, X〉‖_ψ₂ ≤ C′· ‖β* − β‖₂, since $〈 β^{*} - β, X 〉 ~ N (0, {‖ β^{*} - β ‖}_{2}^{2})$ . Note that in Condition Statistical-Error(ε, δ, s, n, ℬ) we assume β ∈ ℬ, and for regression with missing covariates ℬ is specified in (3.27). Hence we have

{‖ β^{*} - β ‖}_{2}^{2} \leq κ^{2} \cdot {‖ β^{*} ‖}_{2}^{2} \leq κ^{2} \cdot r^{2} \cdot σ^{2},

where the first inequality follows from (3.27), and the second inequality follows from our assumption that ‖β*‖₂/σ ≤ r on the maximum signal-to-noise ratio. By invoking (B.64), from (B.65) we have

ℙ [| \frac{〈 β^{*} - β, X 〉 \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} | > u] \leq 2 \cdot exp [- C' \cdot u^{2} / (κ^{2} \cdot r^{2})],

which further implies that the ψ₂-norm of term (iii) in (B.62) is upper bounded by C″ · κ · r.

Analysis of Term (iv)

For term (iv), note that 〈β, (1−Z) ⊙ X〉 = 〈(1−Z) ⊙ β, X〉. By invoking the same conditioning argument in (B.63), we have

ℙ [| \frac{〈 β, (1 - Z) ⊙ X 〉 \cdot (1 - Z_{j}) \cdot β_{j}}{σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}} | > u] \leq 𝔼_{Z} {2 \cdot exp (- C \cdot u^{2} \cdot {[σ^{2} + {‖ (1 - Z) ⊙ β ‖}_{2}^{2}]}^{2} / {‖ (1 - Z) ⊙ β ‖}_{2}^{4})} \leq 𝔼_{Z} [2 \cdot exp (- C \cdot u^{2})] = 2 \cdot exp (- C \cdot u^{2}),

(B.66)

where we utilize the fact that ‖〈(1 − Z) ⊙ β, X〉‖_ψ₂ ≤ C′· ‖(1 − Z) ⊙ β‖₂ when conditioning on Z. By (B.66) we have that the ψ₂-norm of term (iv) in (B.62) is upper bounded by C″ > 0.

Combining the above upper bounds for the ψ₂-norms of terms (i)–(iv) in (B.62), by Lemma D.1 we have that

{‖ {\bar{m}}_{j} ‖}_{ψ_{2}} \leq \sqrt{C + C' \cdot κ^{2} \cdot r^{2}} \leq C ″ \cdot (1 + κ \cdot r)

holds for a sufficiently large absolute constant C″>0. Therefore, we establish Lemma B.3.

C Proof of Results for Inference

In the following, we provide the detailed proof of the theoretical results for asymptotic inference in §4. We first present the proof of the general results, and then the proof for specific models.

C.1 Proof of Lemma 2.1

Proof

In the sequel we establish the two equations in (2.12) respectively.

Proof of the First Equation

According to the definition of the lower bound function Q_n(·; ·) in (2.5), we have

Q_{n} (β'; β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β} (z | y_{i}) \cdot log f_{β'} (y_{i}, z) d z .

(C.1)

Here k_β(z | y_i) is the density of the latent variable Z conditioning on the observed variable Y = y_i under the model with parameter β. Hence we obtain

\nabla_{1} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β} (z | y_{i}) \cdot \frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)} d z = \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{h_{β} (y_{i})} d z,

(C.2)

where h_β(y_i) is the marginal density function of Y evaluated at y_i, and the second equality follows from the fact that

k_{β} (z | y_{i}) = f_{β} (y_{i}, z) / h_{β} (y_{i}),

(C.3)

since k_β(z | y_i) is the conditional density. According to the definition in (2.2), we have

\nabla ℓ_{n} (β) = \sum_{i = 1}^{n} \frac{\partial log h_{β} (y_{i})}{\partial β} = \sum_{i = 1}^{n} \frac{\partial h_{β} (y_{i}) / \partial β}{h_{β} (y_{i})} = \sum_{i = 1}^{n} \int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{h_{β} (y_{i})} d z,

(C.4)

where the last equality is from (2.1). Comparing (C.2) and (C.4), we obtain ∇₁Q_n(β; β)=∇ℓ_n(β)/n.

Proof of the Second Equation

For the second equation in (2.12), by (C.1) and (C.3) we have

Q_{n} (β'; β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} \frac{f_{β} (y_{i}, z)}{h_{β} (y_{i})} \cdot log f_{β'} (y_{i}, z) d z .

By calculation we obtain

\nabla_{1, 2}^{2} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)} \otimes {\frac{\partial f_{β} (y_{i}, z) / \partial β \cdot h_{β} (y_{i})}{{[h_{β} (y_{i})]}^{2}} - \frac{f_{β} (y_{i}, z) \cdot \partial h_{β} (y_{i}) / \partial β}{{[h_{β} (y_{i})]}^{2}}} d z .

(C.5)

Here ⊗ denotes the vector outer product. Note that in (C.5) we have

\int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)} \otimes \frac{\partial f_{β} (y_{i}, z) / \partial β}{h_{β} (y_{i})} d z = \int_{𝒵} {[\frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)}]}^{\otimes 2} \cdot \frac{f_{β} (y_{i}, z)}{h_{β} (y_{i})} d z = \int_{𝒵} {[\frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)}]}^{\otimes 2} \cdot k_{β} (z | y_{i}) d z = 𝔼_{β} [{\tilde{S}}_{β} {(Y, Z)}^{\otimes 2} | Y = y_{i}],

(C.6)

where v^⊗2 denotes v ⊗ v. Here S̃_β(·, ·) is defined as

{\tilde{S}}_{β} (y, z) = \frac{\partial \log f_{β} (y, z)}{\partial β} = \frac{\partial f_{β} (y, z) / \partial β}{f_{β} (y, z)} \in ℝ^{d},

(C.7)

i.e., the score function for the complete likelihood, which involves both the observed variable Y and the latent variable Z. Meanwhile, in (C.5) we have

\int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)} \otimes \frac{f_{β} (y_{i}, z) \cdot \partial h_{β} (y_{i}) / \partial β}{{[h_{β} (y_{i})]}^{2}} d z = [\int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{h_{β} (y_{i})} d z] \otimes \frac{\partial h_{β} (y_{i}) / \partial β}{h_{β} (y_{i})} = {[\int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{h_{β} (y_{i})} d z]}^{\otimes 2},

(C.8)

where in the last equality we utilize the fact that

\int_{𝒵} f_{β} (y_{i}, z) d z = h_{β} (y_{i}),

(C.9)

because h_β(·) is the marginal density function of Y. By (C.3) and (C.7), for the right-hand side of (C.8) we have

𝔼_{β} [{\tilde{S}}_{β} (Y, Z) | Y = y_{i}] = \int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{f_{β} (y_{i}, z)} \cdot k_{β} (z | y_{i}) d z = \int_{𝒵} \frac{\partial f_{β} (y_{i}, z) / \partial β}{h_{β} (y_{i})} d z .

(C.10)

Plugging (C.10) into (C.8) and then plugging (C.6) and (C.8) into (C.5) we obtain

\nabla_{1, 2}^{2} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} (𝔼_{β} [{\tilde{S}}_{β} {(Y, Z)}^{\otimes 2} | Y = y_{i}] - {𝔼_{β} [{\tilde{S}}_{β} (Y, Z) | Y = y_{i}]}^{\otimes 2}) .

Setting β = β* in the above equality, we obtain

𝔼_{β^{*}} [\nabla_{1, 2}^{2} Q_{n} (β^{*}; β^{*})] = 𝔼_{β^{*}} {{Cov}_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z) | Y]} .

(C.11)

Meanwhile, for β = β*, by the property of Fisher information we have

I (β^{*}) = {Cov}_{β^{*}} [\nabla ℓ_{n} (β^{*})] = {Cov}_{β^{*}} [\frac{\partial \log h_{β^{*}} (Y)}{\partial β}] = {Cov}_{β^{*}} {𝔼_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z) | Y]} .

(C.12)

Here the last equality is obtained by taking β = β* in

\frac{\partial log h_{β} (Y)}{\partial β} = \frac{\partial h_{β} (Y)}{\partial β} \cdot \frac{1}{h_{β} (Y)} = \int_{𝒵} \frac{\partial f_{β} (Y, z) / \partial β}{h_{β} (Y)} d z = \int_{𝒵} \frac{\partial f_{β} (Y, z) / \partial β}{f_{β} (Y, z)} \cdot k_{β} (z | Y) d z = \int_{𝒵} {\tilde{S}}_{β} (Y, z) \cdot k_{β} (z | Y) d z = 𝔼_{β} [{\tilde{S}}_{β} (Y, Z) | Y],

where the second equality follows from (C.9), the third equality follows from (C.3), while the second last equality follows from (C.7). Combining (C.11) and (C.12), by the law of total variance we have

I (β^{*}) + 𝔼_{β^{*}} [\nabla_{1, 2}^{2} Q_{n} (β^{*}; β^{*})] = {Cov}_{β^{*}} {𝔼_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z) | Y]} + 𝔼_{β^{*}} {{Cov}_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z) | Y]} = {Cov}_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z)] .

(C.13)

In the following, we prove

𝔼_{β^{*}} [\nabla_{1, 1}^{2} Q_{n} (β^{*}; β^{*})] = - {Cov}_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z)] .

(C.14)

According to (C.1) we have

\nabla_{1, 1}^{2} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} \int_{𝒵} k_{β} (z | y_{i}) \cdot \frac{\partial^{2} log f_{β} (y_{i}, z)}{\partial^{2} β} d z = \frac{1}{n} \sum_{i = 1}^{n} 𝔼_{β} [\frac{\partial^{2} log f_{β} (Y, Z)}{\partial^{2} β} | Y = y_{i}] .

(C.15)

Let ℓ̃(β)=log f_β(Y, Z) be the complete log-likelihood, which involves both the observed variable Y and the latent variable Z, and Ĩ (β) be the corresponding Fisher information. By setting β = β* in (C.15) and taking expectation, we obtain

𝔼_{β^{*}} [\nabla_{1, 1}^{2} Q_{n} (β^{*}; β^{*})] = 𝔼_{β^{*}} {𝔼_{β^{*}} [\frac{\partial^{2} log f_{β^{*}} (Y, Z)}{\partial^{2} β} | Y]} = 𝔼_{β^{*}} [\nabla^{2} \tilde{ℓ} (β^{*})] = - \tilde{I} (β^{*}) .

(C.16)

Since S̃_β(Y, Z) defined in (C.7) is the score function for the complete log-likelihood ℓ̃(β), according to the relationship between the score function and Fisher information, we have

\tilde{I} (β^{*}) = {Cov}_{β^{*}} [{\tilde{S}}_{β^{*}} (Y, Z)],

which together with (C.16) implies (C.14). By further plugging (C.14) into (C.13), we obtain

𝔼_{β^{*}} [\nabla_{1, 1}^{2} Q_{n} (β^{*}; β^{*}) + \nabla_{1, 2}^{2} Q_{n} (β^{*}; β^{*})] = - I (β^{*}),

which establishes the first equality of the second equation in (2.12). In addition, the second equality of the second equation in (2.12) follows from the property of Fisher information. Thus, we conclude the proof of Lemma 2.1.

C.2 Auxiliary Lemmas for Proving Theorems 4.6 and 4.7

In this section, we lay out several lemmas on the Dantzig selector defined in (2.10). The first lemma, which is from Bickel et al. (2009), characterizes the cone condition for the Dantzig selector.

Lemma C.1

Any feasible solution w in (2.10) satisfies

{‖ {[w (β, λ) - w]}_{\bar{𝒮_{w}}} ‖}_{1} \leq {‖ {[w (β, λ) - w]}_{𝒮_{w}} ‖}_{1},

where w(β, λ) is the minimizer of (2.10), 𝒮_w is the support of w and $\bar{𝒮_{w}}$ is its complement.

Proof

See Lemma B.3 in Bickel et al. (2009) for a detailed proof.

In the sequel, we focus on analyzing w(β̂, λ). The results for w(β̂₀, λ) can be obtained similarly. The next lemma characterizes the restricted eigenvalue of T_n(β̂), which is defined as

{\hat{ρ}}_{min} = inf_{v \in 𝒞} \frac{v^{⊤} \cdot [- T_{n} (\hat{β})] \cdot v}{{‖ v ‖}_{2}^{2}}, where 𝒞 = {v : {‖ v_{\bar{𝒮_{w}^{*}}} ‖}_{1} \leq {‖ v_{𝒮_{w}^{*}} ‖}_{1}, v \neq 0} .

(C.17)

Here $𝒮_{w}^{*}$ is the support of w* defined in (4.1).

Lemma C.2

Under Assumption 4.5 and Conditions 4.1, 4.3 and 4.4, for a sufficiently large sample size n, we have ρ̂_min ≥ ρ_min/2 > 0 with high probability, where ρ_min is specified in (4.4).

Proof

By triangle inequality we have

{\hat{ρ}}_{min} \geq inf_{v \in 𝒞} \frac{v^{⊤} \cdot [- T_{n} (\hat{β})] \cdot v}{{‖ v ‖}_{2}^{2}} \geq inf_{v \in 𝒞} \frac{v^{⊤} \cdot I (β^{*}) \cdot v - | v^{⊤} \cdot [I (β^{*}) + T_{n} (\hat{β})] \cdot v |}{{‖ v ‖}_{2}^{2}} \geq \underset{(i)}{\underset{︸}{inf_{v \in 𝒞} \frac{v^{⊤} \cdot I (β^{*}) \cdot v}{{‖ v ‖}_{2}^{2}}}} - \underset{(i i)}{\underset{︸}{sup_{v \in 𝒞} \frac{| v^{⊤} \cdot [I (β^{*}) + T_{n} (\hat{β})] \cdot v |}{{‖ v ‖}_{2}^{2}}}},

(C.18)

where 𝒞 is defined in (C.17).

Analysis of Term (i)

For term (i) in (C.18), by (4.4) in Assumption 4.5 we have

inf_{v \in 𝒞} \frac{v^{⊤} \cdot I (β^{*}) \cdot v}{{‖ v ‖}_{2}^{2}} \geq inf_{v \neq 0} \frac{v^{⊤} \cdot I (β^{*}) \cdot v}{{‖ v ‖}_{2}^{2}} = λ_{d} [I (β^{*})] \geq ρ_{min} .

(C.19)

Analysis of Term (ii)

For term (ii) in (C.18) we have

sup_{v \in 𝒞} \frac{| v^{⊤} \cdot [I (β^{*}) + T_{n} (\hat{β})] \cdot v |}{{‖ v ‖}_{2}^{2}} \leq sup_{v \in 𝒞} \frac{{‖ v ‖}_{1}^{2} \cdot {‖ I (β^{*}) + T_{n} (\hat{β}) ‖}_{\infty, \infty}}{{‖ v ‖}_{2}^{2}} .

(C.20)

By the definition of 𝒞 in (C.17), for any v ∈ 𝒞 we have

{‖ v ‖}_{1} = {‖ v_{𝒮_{w}^{*}} ‖}_{1} + {‖ v_{\bar{𝒮_{w}^{*}}} ‖}_{1} \leq 2 \cdot {‖ v_{𝒮_{w}^{*}} ‖}_{1} \leq 2 \cdot \sqrt{s_{w}^{*}} \cdot {‖ v_{𝒮_{w}^{*}} ‖}_{2} \leq 2 \cdot \sqrt{s_{w}^{*}} \cdot {‖ v ‖}_{2} .

Therefore, the right-hand side of (C.20) is upper bounded by

4 \cdot s_{w}^{*} \cdot {‖ I (β^{*}) + T_{n} (\hat{β}) ‖}_{\infty, \infty} \leq \underset{(i i) . a}{\underset{︸}{4 \cdot s_{w}^{*} \cdot {‖ I (β^{*}) + T_{n} (β^{*}) ‖}_{\infty, \infty}}} + \underset{(i i) . b}{\underset{︸}{4 \cdot s_{w}^{*} \cdot {‖ T (\hat{β}) - T_{n} (β^{*}) ‖}_{\infty, \infty}}} .

For term (ii).a, by Lemma 2.1 and Condition 4.3 we have

4 \cdot s_{w}^{*} \cdot {‖ I (β^{*}) + T_{n} (β^{*}) ‖}_{\infty, \infty} = 4 \cdot s_{w}^{*} \cdot {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{ℙ} (s_{w}^{*} \cdot ζ^{T}) = o_{ℙ} (1),

where the last equality is from (4.6) in Assumption 4.5, since for λ specified in (4.5) we have

s_{w}^{*} \cdot ζ^{T} \leq s_{w}^{*} \cdot λ \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1) .

For term (ii).b, by Conditions 4.1 and 4.4 we have

4 \cdot s_{w}^{*} \cdot {‖ T (\hat{β}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = 4 \cdot s_{w}^{*} \cdot O_{ℙ} (ζ^{L}) \cdot {‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (s_{w}^{*} \cdot ζ^{L} \cdot ζ^{E M}) = o_{ℙ} (1),

where the last equality is also from (4.6) in Assumption 4.5, since for λ specified in (4.5) we have

s_{w}^{*} \cdot ζ^{L} \cdot ζ^{E M} \leq s_{w}^{*} \cdot λ \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1) .

Hence, term (ii) in (C.18) is o_ℙ(1). Since ρ_min is an absolute constant, for a sufficiently large n we have that term (ii) is upper bounded by ρ_min/2 with high probability. Further by plugging this and (C.19) into (C.18), we conclude that ρ̂_min ≥ ρ_min/2 holds with high probability.

The next lemma quantifies the statistical accuracy of w(β̂, λ), where w(·, ·) is defined in (2.10).

Lemma C.3

Under Assumption 4.5 and Conditions 4.1–4.4, for λ specified in (4.5) we have that

max {{‖ w (\hat{β}, λ) - w^{*} ‖}_{1}, {‖ w ({\hat{β}}_{0}, λ) - w^{*} ‖}_{1}} \leq \frac{16 \cdot s_{w}^{*} \cdot λ}{ρ_{min}}

holds with high probability. Here ρ_min is specified in (4.4), while w* and $s_{w}^{*}$ are defined (4.1).

Proof

For λ specified in (4.5), we verify that w* is a feasible solution in (2.10) with high probability. For notational simplicity, we define the following event

ℰ = {{‖ {[T_{n} (\hat{β})]}_{γ, α} - {[T_{n} (\hat{β})]}_{γ, γ} \cdot w^{*} ‖}_{\infty} \leq λ} .

(C.21)

By the definition of w* in (4.1), we have [I(β*)]_γ,α − [I(β*)]_γ,γ · w* = 0. Hence, we have

{‖ {[T_{n} (\hat{β})]}_{γ, α} - {[T_{n} (\hat{β})]}_{γ, γ} \cdot w^{*} ‖}_{\infty} = {‖ {[T_{n} (\hat{β}) + I (β^{*})]}_{γ, α} - {[T_{n} (\hat{β}) + I (β^{*})]}_{γ, γ} \cdot w^{*} ‖}_{\infty} \leq {‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} + {‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} \cdot {‖ w^{*} ‖}_{1},

(C.22)

where the last inequality is from triangle inequality and Hölder’s inequality. Note that we have

{‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} \leq {‖ T_{n} (β^{*}) + I (β^{*}) ‖}_{\infty, \infty} + {‖ T (\hat{β}) - T_{n} (β^{*}) ‖}_{\infty, \infty} .

(C.23)

On the right-hand side, by Lemma 2.1 and Condition 4.3 we have

{‖ T_{n} (β^{*}) + I (β^{*}) ‖}_{\infty, \infty} = {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{𝕡} (ζ^{T}),

while by Conditions 4.1 and 4.4 we have

{‖ T (\hat{β}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{𝕡} (ζ^{L}) \cdot {‖ \hat{β} - β^{*} ‖}_{1} = O_{𝕡} (ζ^{L} \cdot ζ^{E M}) .

Plugging the above equations into (C.23) and further plugging (C.23) into (C.22), by (4.5) we have

{‖ {[T_{n} (\hat{β})]}_{γ, α} - {[T_{n} (\hat{β})]}_{γ, γ} \cdot w^{*} ‖}_{\infty} \leq C \cdot (ζ^{T} + ζ^{L} \cdot ζ^{E M}) \cdot (1 + {‖ w^{*} ‖}_{1}) = λ

holds with high probability for a sufficiently large absolute constant C ≥ 1. In other words, ℰ occurs with high probability. The subsequent proof will be conditioning on ℰ and the following event

ℰ' = {{\hat{ρ}}_{min} \geq ρ_{min} / 2 > 0},

(C.24)

which also occurs with high probability according to Lemma C.2. Here ρ̂_min is defined in (C.17).

For notational simplicity, we denote w(β̂, λ)= ŵ. By triangle inequality we have

{‖ [T_{n} {(\hat{β})]}_{γ, γ} \cdot (ŵ - w^{*}) ‖}_{\infty} \leq {‖ {[T_{n} (\hat{β})]}_{γ, α} - {[T_{n} (\hat{β})]}_{γ, γ} \cdot w^{*} ‖}_{\infty} + {‖ {[T_{n} (\hat{β})]}_{γ, γ} \cdot ŵ - [T_{n} {(\hat{β})]}_{γ, α} ‖}_{\infty} \leq 2 \cdot λ,

(C.25)

where the last inequality follows from (2.10) and (C.21). Moreover, by (C.17) and (C.24) we have

{(ŵ - w^{*})}^{⊤} \cdot {[- T_{n} (\hat{β})]}_{γ, γ} \cdot (ŵ - w^{*}) \geq {\hat{ρ}}_{min} \cdot {‖ ŵ - w^{*} ‖}_{2}^{2} \geq ρ_{min} / 2 \cdot {‖ ŵ - w^{*} ‖}_{2}^{2} .

(C.26)

Meanwhile, by Lemma C.1 we have

{‖ ŵ - w^{*} ‖}_{1} = {‖ {(ŵ - w^{*})}_{𝒮_{w}^{*}} ‖}_{1} + {‖ {(ŵ - w^{*})}_{\bar{𝒮_{w}^{*}}} ‖}_{1} \leq 2 \cdot {‖ {(ŵ - w^{*})}_{𝒮_{w}^{*}} ‖}_{1} \leq 2 \cdot \sqrt{s_{w}^{*}} \cdot {‖ ŵ - w^{*} ‖}_{2} .

Plugging the above inequality into (C.26), we obtain

{(ŵ - w^{*})}^{⊤} \cdot {[- T_{n} (\hat{β})]}_{γ, γ} \cdot (ŵ - w^{*}) \geq ρ_{min} / (8 \cdot s_{w}^{*}) \cdot {‖ ŵ - w^{*} ‖}_{1}^{2} .

(C.27)

Note that by (C.25), the left-hand side of (C.27) is upper bounded by

{‖ ŵ - w^{*} ‖}_{1} \cdot {‖ {[T_{n} (\hat{β})]}_{γ, γ} \cdot (ŵ - w^{*}) ‖}_{\infty} \leq {‖ ŵ - w^{*} ‖}_{1} \cdot 2 \cdot λ .

(C.28)

By (C.27) and (C.28), we then obtain ${‖ ŵ - w^{*} ‖}_{1} \leq 16 \cdot s_{w}^{*} \cdot λ / ρ_{min}$ conditioning on ℰ and ℰ′, both of which hold with high probability. Note that the proof for w(β̂₀, λ) follows similarly. Therefore, we conclude the proof of Lemma C.3.

C.3 Proof of Lemma 5.3

Proof

Our proof strategy is as follows. First we prove that

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) = \sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} + o_{𝕡} (1),

(C.29)

where β* is the true parameter and w* is defined in (4.1). We then prove

\sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}),

(C.30)

where [I(β*)]_α|γ is defined in (4.2). Throughout the proof, we abbreviate w(β̂₀, λ) as ŵ₀. Also, it is worth noting that our analysis is under the null hypothesis where β* = [α*, (γ*)^⊤]^⊤ with α* = 0.

Proof of (C.29)

For (C.29), by the definition of the decorrelated score function in (2.9) we have

S_{n} ({\hat{β}}_{0}, λ) = {[\nabla_{1} Q_{n} ({\hat{β}}_{0}; {\hat{β}}_{0})]}_{α} - ŵ_{0}^{⊤} \cdot {[\nabla_{1} Q_{n} ({\hat{β}}_{0}; {\hat{β}}_{0})]}_{γ} .

By the mean-value theorem, we obtain

S_{n} ({\hat{β}}_{0}, λ) = \overset{(i)}{\overset{︷}{[\nabla_{1} Q_{n} {(β^{*}; β^{*})]}_{α} - ŵ_{0}^{⊤} \cdot [\nabla_{1} Q_{n} {(β^{*}; β^{*})]}_{γ}}} + \underset{(i i)}{\underset{︸}{[T_{n} {(β^{♯})]}_{γ, α}^{⊤} \cdot ({\hat{β}}_{0} - β^{*}) - ŵ_{0}^{⊤} \cdot {[T_{n} (β^{♯})]}_{γ, γ} \cdot ({\hat{β}}_{0} - β^{*})}},

(C.31)

where we have $T_{n} (β) = \nabla_{1, 1}^{2} Q_{n} (β; β) + \nabla_{1, 2}^{2} Q_{n} (β; β)$ as defined in (2.8), and β^♯ is an intermediate value between β* and β̂₀.

Analysis of Term (i)

For term (i) in (C.31), we have

[\nabla_{1} Q_{n} {(β^{*}; β^{*})]}_{α} - ŵ_{0}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} = {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} + {(w^{*} - ŵ_{0})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} .

(C.32)

For the right-hand side of (C.32), we have

{(w^{*} - ŵ_{0})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} \leq {‖ w^{*} - ŵ_{0} ‖}_{1} \cdot {‖ {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} ‖}_{\infty} .

(C.33)

By Lemma C.3, we have ${‖ w^{*} - ŵ_{0} ‖}_{1} = O_{ℙ} (s_{w}^{*} \cdot λ)$ , where λ is specified in (4.5). Meanwhile, we have

{‖ {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} ‖}_{\infty} \leq {‖ \nabla_{1} Q_{n} (β^{*}; β^{*}) ‖}_{\infty} = {‖ \nabla_{1} Q_{n} (β^{*}; β^{*}) - \nabla_{1} Q (β^{*}; β^{*}) ‖}_{\infty} = O_{𝕡} (ζ^{G}),

where the first equality follows from the self-consistency property (McLachlan and Krishnan, 2007) that β* = argmax_β Q(β; β*), which gives ∇₁Q(β*; β*) = 0. Here the last equality is from Condition 4.2. Therefore, (C.33) implies

{(w - ŵ_{0})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} = O_{𝕡} (s_{w}^{*} \cdot λ \cdot ζ^{G}) = o_{𝕡} (1 / \sqrt{n}),

where the second equality is from $s_{w}^{*} \cdot λ \cdot ζ^{G} = o (1 / \sqrt{n})$ in (4.6) of Assumption 4.5. Thus, by (C.32) we conclude that term (i) in (C.31) equals

{[\nabla_{1} Q_{n} {(β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot [\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} + o_{𝕡} (1 / \sqrt{n}) .

Analysis of Term (ii)

By triangle inequality, term (ii) in (C.31) is upper bounded by

\underset{(i i) . a}{\underset{︸}{| {[T_{n} ({\hat{β}}_{0})]}_{γ, α}^{⊤} \cdot ({\hat{β}}_{0} - β^{*}) - ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, γ} \cdot ({\hat{β}}_{0} - β^{*}) |}} + \underset{(i i) . b}{\underset{︸}{| {[T_{n} (β^{♯})]}_{γ, α}^{⊤} \cdot ({\hat{β}}_{0} - β^{*}) - {[T_{n} ({\hat{β}}_{0})]}_{γ, α}^{⊤} \cdot ({\hat{β}}_{0} - β^{*}) |}} + \underset{(i i) . c}{\underset{︸}{| ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, γ} \cdot ({\hat{β}}_{0} - β^{*}) - ŵ_{0}^{⊤} \cdot {[T_{n} (β^{♯})]}_{γ, γ} \cdot ({\hat{β}}_{0} - β^{*}) |}} .

(C.34)

By Hölder’s inequality, term (ii).a in (C.34) is upper bounded by

{‖ {\hat{β}}_{0} - β^{*} ‖}_{1} \cdot {‖ {[T_{n} ({\hat{β}}_{0})]}_{γ, α} - ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, γ} ‖}_{\infty} = {‖ {\hat{β}}_{0} - β^{*} ‖}_{1} \cdot λ \leq O_{ℙ} (ζ^{E M}) \cdot λ = o_{ℙ} (1 / \sqrt{n}),

(C.35)

where the first inequality holds because ŵ₀ = w(β̂₀, λ) is a feasible solution in (2.10). Meanwhile, Condition 4.1 gives ‖β̂ − β*‖₁ = O_ℙ(ζ^EM). Also note that by definition we have (β̂₀)_α = (β*)_α = 0, which implies ‖β̂₀ − β*‖₁ ≤ ‖β̂ − β*‖₁. Hence, we have

{‖ {\hat{β}}_{0} - β^{*} ‖}_{1} = O_{ℙ} (ζ^{E M}),

(C.36)

which implies the first equality in (C.35). The last equality in (C.35) follows from $ζ^{E M} \cdot λ = o (1 / \sqrt{n})$ in (4.6) of Assumption 4.5. Note that term (ii).b in (C.34) is upper bounded by

{‖ {[T_{n} (β^{♯})]}_{γ, α} - {[T_{n} ({\hat{β}}_{0})]}_{γ, α} ‖}_{\infty} \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1} \leq {‖ T_{n} (β^{♯}) - T_{n} ({\hat{β}}_{0}) ‖}_{\infty, \infty} \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1} .

(C.37)

For the first term on the right-hand side of (C.37), by triangle inequality we have

{‖ T_{n} (β^{♯}) - T_{n} ({\hat{β}}_{0}) ‖}_{\infty, \infty} \leq {‖ T_{n} (β^{♯}) - T_{n} (β^{*}) ‖}_{\infty, \infty} + {‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} .

By Condition 4.4, we have

{‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L}) \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1},

(C.38)

and

{‖ T_{n} (β^{♯}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L}) \cdot {‖ β^{♯} - β^{*} ‖}_{1} \leq O_{ℙ} (ζ^{L}) \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1},

(C.39)

where the last inequality in (C.39) holds because β^♯ is defined as an intermediate value between β* and β̂₀. Further by plugging (C.36) into (C.38), (C.39) as well as the second term on the right-hand side of (C.37), we have that term (ii).b in (C.34) is O_ℙ [ζ^L · (ζ^EM)²]. Moreover, by our assumption in (4.6) of Assumption 4.5 we have

ζ^{L} \cdot {(ζ^{E M})}^{2} \leq max {1, {‖ w^{*} ‖}_{1}} \cdot ζ^{L} \cdot {(ζ^{E M})}^{2} = o (1 / \sqrt{n}) .

Thus, we conclude that term (ii).b is $o_{ℙ} (1 / \sqrt{n})$ . Similarly, term (ii).c in (C.34) is upper bounded by

{‖ ŵ_{0} ‖}_{1} \cdot {‖ T_{n} (β^{♯}) - T_{n} ({\hat{β}}_{0}) ‖}_{\infty, \infty} \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1} .

(C.40)

By triangle inequality and Lemma C.3, the first term in (C.40) is upper bounded by

{‖ w^{*} ‖}_{1} + {‖ ŵ_{0} - w^{*} ‖}_{1} = {‖ w^{*} ‖}_{1} + O_{𝕡} (s_{w}^{*} \cdot λ) .

Meanwhile, for the second and third terms in (C.40), by the same analysis for term (ii).b in (C.34) we have

{‖ T_{n} (β^{♯}) - T_{n} ({\hat{β}}_{0}) ‖}_{\infty, \infty} \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1} = O_{𝕡} [ζ^{L} \cdot {(ζ^{E M})}^{2}] .

By (4.6) in Assumption 4.5, since $s_{w}^{*} \cdot λ = o (1)$ , we have

({‖ w^{*} ‖}_{1} + s_{w}^{*} \cdot λ) \cdot ζ^{L} \cdot {(ζ^{E M})}^{2} \leq [max {1, {‖ w^{*} ‖}_{1}} + o (1)] \cdot ζ^{L} \cdot {(ζ^{E M})}^{2} = o (1 / \sqrt{n}) .

Therefore, term (ii).c in (C.34) is $o_{ℙ} (1 / \sqrt{n})$ . Hence, by (C.34) we conclude that term (ii) in (C.31) is $o_{ℙ} (1 / \sqrt{n})$ . Combining the analysis for terms (i) and (ii) in (C.31), we then obtain (C.29). In the sequel, we turn to prove the second part on asymptotic normality.

Proof of (C.30)

Note that by Lemma 2.1, we have

\sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} = \sqrt{n} \cdot [1, - {(w^{*})}^{⊤}] \cdot \nabla_{1} Q_{n} (β^{*}; β^{*}) = \sqrt{n} \cdot [1, - {(w^{*})}^{⊤}] \cdot \nabla ℓ_{n} (β^{*}) / n .

(C.41)

Recall that ℓ_n(β*) is the log-likelihood function defined in (2.2). Hence, [1, −(w*)^⊤] · ∇ℓ_n(β*)/n is the average of n independent random variables. Meanwhile, the score function has mean zero at β*, i.e., 𝔼[∇ℓ_n(β*)] = 0. For the variance of the rescaled average in (C.41), we have

Var {\sqrt{n} \cdot [1, - {(w^{*})}^{⊤}] \cdot \nabla ℓ_{n} (β^{*}) / n} = [1, - {(w^{*})}^{⊤}] \cdot Cov [\nabla ℓ_{n} (β^{*}) / \sqrt{n}] \cdot {[1, - {(w^{*})}^{⊤}]}^{⊤} = [1, - {(w^{*})}^{⊤}] \cdot I (β^{*}) \cdot {[1, - {(w^{*})}^{⊤}]}^{⊤} .

Here the second equality is from the fact that the covariance of the score function equals the Fisher information (up to renormalization). Hence, the variance of each item in the average in (C.41) is

[1, - {(w^{*})}^{⊤}] \cdot I (β^{*}) \cdot {[1, - {(w^{*})}^{⊤}]}^{⊤} = {[I (β^{*})]}_{α, α} - 2 \cdot {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} = {[I (β^{*})]}_{α, α} - {[I (β^{*})]}_{γ, α} \cdot {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α} = {[I (β^{*})]}_{α | γ},

where the second and third equalities are from (4.1) and (4.2). Hence, by the central limit theorem we obtain (C.30). Finally, combining (C.29) and (C.30) by invoking Slutsky’s theorem, we obtain

\sqrt{n} \cdot S_{n} ({\hat{β}}_{0}, λ) \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}),

which concludes the proof of Lemma 5.3.

C.4 Proof of Lemma 5.4

Proof

Throughout the proof, we abbreviate w(β̂₀, λ) as ŵ₀. Our proof is under the null hypothesis where β* = [α*, (γ*)^⊤]^⊤ with α* = 0. Recall that w* is defined in (4.1). Then by the definitions of [T_n(β̂₀)]_α|γand [I(β*)]_α|γ in (2.11) and (4.2), we have

{[T_{n} ({\hat{β}}_{0})]}_{α | γ} = (1, - ŵ_{0}^{⊤}) \cdot T_{n} ({\hat{β}}_{0}) \cdot {(1, - ŵ_{0}^{⊤})}^{⊤} = {[T_{n} ({\hat{β}}_{0})]}_{α, α} - 2 \cdot ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, α} + ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, γ} \cdot ŵ_{0},

{[I (β^{*})]}_{α | γ} = {[I (β^{*})]}_{α, α} - {[I (β^{*})]}_{γ, α}^{⊤} \cdot {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α} = {[I (β^{*})]}_{α, α} - 2 \cdot {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} .

By triangle inequality, we have

| {[T_{n} ({\hat{β}}_{0})]}_{α | γ} + {[I (β^{*})]}_{α | γ} | \leq \underset{(i)}{\underset{︸}{| {[T_{n} ({\hat{β}}_{0})]}_{α, α} + {[I (β^{*})]}_{α, α} |}} + 2 \cdot \underset{(i i)}{\underset{︸}{| ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} |}} + \underset{(i i i)}{\underset{︸}{| ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, γ} \cdot ŵ_{0} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} |}} .

(C.42)

Analysis of Term (i)

For term (i) in (C.42), by Lemma 2.1 and triangle inequality we have

| {[T_{n} ({\hat{β}}_{0})]}_{α, α} + {[I (β^{*})]}_{α, α} | \leq \underset{(i) . a}{\underset{︸}{| {[T_{n} ({\hat{β}}_{0})]}_{α, α} - {[T_{n} (β^{*})]}_{α, α} |}} + \underset{(i) . b}{\underset{︸}{| {[T_{n} (β^{*})]}_{α, α} - {𝔼_{β^{*}} [T_{n} (β^{*})]}_{α, α} |}} .

(C.43)

For term (i).a in (C.43), by Condition 4.4 we have

| {[T_{n} ({\hat{β}}_{0})]}_{α, α} - {[T_{n} (β^{*})]}_{α, α} | \leq {‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L}) \cdot {‖ {\hat{β}}_{0} - β^{*} ‖}_{1} .

(C.44)

Note that we have (β̂₀)_α = (β*)_α = 0 by definition, which implies ‖β̂₀ − β*‖₁ ≤ ‖β̂ − β*‖₁. Hence, by Condition 4.1 we have

{‖ {\hat{β}}_{0} - β^{*} ‖}_{1} = O_{ℙ} (ζ^{E M}) .

(C.45)

Moreover, combining (C.44) and (C.45), by (4.6) in Assumption 4.5 we have

ζ^{L} \cdot ζ^{E M} \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)

for λ specified in (4.5). Hence we obtain

| {[T_{n} ({\hat{β}}_{0})]}_{α, α} - {[T_{n} (β^{*})]}_{α, α} | \leq {‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L} \cdot ζ^{E M}) = o_{ℙ} (1) .

(C.46)

Meanwhile, for term (i).b in (C.43) we have

| {[T_{n} (β^{*})]}_{α, α} - {𝔼_{β^{*}} [T_{n} (β^{*})]}_{α, α} | \leq {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{ℙ} (ζ^{T}) = o_{ℙ} (1),

(C.47)

where the second last equality follows from Condition 4.3, while the last equality holds because our assumption in (4.6) of Assumption 4.5 implies

ζ^{T} \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)

for λ specified in (4.5).

Analysis of Term (ii)

For term (ii) in (C.42), by Lemma 2.1 and triangle inequality, we have

| ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} | \leq \underset{(i i) . a}{\underset{︸}{| {(ŵ_{0} - w^{*})}^{⊤} \cdot {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, α} |}} + \underset{(i i) . b}{\underset{︸}{| {(ŵ_{0} - w^{*})}^{⊤} \cdot {𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, α} |}} + \underset{(i i) . c}{\underset{︸}{| {(w^{*})}^{⊤} \cdot {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, α} |}} .

(C.48)

By Hölder’s inequality, term (ii).a in (C.48) is upper bounded by

{‖ ŵ_{0} - w^{*} ‖}_{1} \cdot {‖ {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} {(β)}^{*}]}_{γ, α} ‖}_{\infty} \leq {‖ ŵ_{0} - w^{*} ‖}_{1} \cdot {‖ T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} .

By Lemma C.3, we have ${‖ ŵ_{0} - w^{*} ‖}_{1} = O_{ℙ} (s_{w}^{*} \cdot λ)$ . Meanwhile, we have

{‖ T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} \leq {‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} + {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = o_{ℙ} (1) .

where the second equality follows from (C.46) and (C.47). Therefore, term (ii).a is o_ℙ(1), since (4.6) in Assumption 4.5 implies $s_{w}^{*} \cdot λ = o (1)$ . Meanwhile, by Hölder’s inequality, term (ii).b in (C.48) is upper bounded by

{‖ ŵ_{0} - w^{*} ‖}_{1} \cdot {‖ {𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, α} ‖}_{\infty} \leq {‖ ŵ_{0} - w^{*} ‖}_{1} \cdot {‖ 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} .

(C.49)

By Lemma C.3, we have ${‖ ŵ_{0} - w^{*} ‖}_{1} = O_{ℙ} (s_{w}^{*} \cdot λ)$ . Meanwhile, we have 𝔼_β* [T_n(β*)] = −I(β*) by Lemma 2.1. Furthermore, (4.4) in Assumption 4.5 implies

{‖ I (β^{*}) ‖}_{\infty, \infty} \leq {‖ I (β^{*}) ‖}_{2} \leq C,

(C.50)

where C>0 is some absolute constant. Therefore, from (C.49) we have that term (ii).b in (C.48) is $O_{ℙ} (s_{w}^{*} \cdot λ)$ . By (4.6) in Assumption 4.5, we have $s_{w}^{*} \cdot λ = o (1)$ . Thus, we conclude that term (ii).b is o_ℙ(1). For term (ii).c, we have

| {(w^{*})}^{⊤} \cdot {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, α} | \leq {‖ w^{*} ‖}_{1} \cdot {‖ T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} \leq {‖ w^{*} ‖}_{1} \cdot {‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} + {‖ w^{*} ‖}_{1} \cdot {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{ℙ} ({‖ w^{*} ‖}_{1} \cdot ζ^{L} \cdot ζ^{E M}) + O_{ℙ} ({‖ w^{*} ‖}_{1} \cdot ζ^{T}) = o_{ℙ} (1) .

Here the first and second inequalities are from Hölder’s inequality and triangle inequality, the first equality follows from (C.46) and (C.47), and the second equality holds because (4.6) in Assumption 4.5 implies

{‖ w^{*} ‖}_{1} \cdot (ζ^{L} \cdot ζ^{E M} + ζ^{T}) \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)

for λ specified in (4.5).

Analysis of Term (iii)

For term (iii) in (C.42), by (2.12) in Lemma 2.1 we have

| ŵ_{0}^{⊤} \cdot {[T_{n} ({\hat{β}}_{0})]}_{γ, γ} \cdot ŵ_{0} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} | \leq \underset{(i i i) . a}{\underset{︸}{| ŵ_{0}^{⊤} \cdot {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, γ} \cdot ŵ_{0} |}} + \underset{(i i i) . b}{\underset{︸}{| ŵ_{0}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot ŵ_{0} - {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} |}} .

(C.51)

For term (iii).a in (C.51), we have

| ŵ_{0}^{⊤} \cdot {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, γ} \cdot ŵ_{0} | \leq {‖ ŵ_{0} ‖}_{1}^{2} \cdot {‖ {T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{γ, γ} ‖}_{\infty, \infty} \leq {‖ ŵ_{0} ‖}_{1}^{2} \cdot {‖ T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} .

(C.52)

For ‖ŵ₀‖₁ we have ${‖ ŵ_{0} ‖}_{1}^{2} \leq {({‖ w^{*} ‖}_{1} + {‖ ŵ_{0} - w^{*} ‖}_{1})}^{2} = {[{‖ w^{*} ‖}_{1} + O_{ℙ} (s_{w}^{*} \cdot λ)]}^{2}$ , where the equality holds because by Lemma C.3 we have ${‖ ŵ_{0} - w^{*} ‖}_{1} = O_{ℙ} (s_{w}^{*} \cdot λ)$ . Meanwhile, on the right-hand side of (C.52) we have

{‖ T_{n} ({\hat{β}}_{0}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} \leq {‖ T_{n} ({\hat{β}}_{0}) - T_{n} (β^{*}) ‖}_{\infty, \infty} + {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L} \cdot ζ^{E M} + ζ^{T}) .

Here the last equality is from (C.46) and (C.47). Hence, term (iii).a in (C.51) is $O_{ℙ} [{({‖ w^{*} ‖}_{1} + s_{w}^{*} \cdot λ)}^{2} \cdot (ζ^{L} \cdot ζ^{E M} + ζ^{T})]$ . Note that

{({‖ w^{*} ‖}_{1} + s_{w}^{*} \cdot λ)}^{2} \cdot (ζ^{L} \cdot ζ^{E M} + ζ^{T}) = \underset{(i)}{\underset{︸}{{‖ w^{*} ‖}_{1}^{2} \cdot (ζ^{L} \cdot ζ^{E M} + ζ^{T})}} + 2 \cdot \underset{(i i)}{\underset{︸}{s_{w}^{*} \cdot λ}} \cdot \underset{(i i i)}{\underset{︸}{{‖ w^{*} ‖}_{1} \cdot (ζ^{L} \cdot ζ^{E M} + ζ^{T})}} + \underset{(i i)}{\underset{︸}{{(s_{w}^{*} \cdot λ)}^{2}}} \cdot \underset{(i v)}{\underset{︸}{(ζ^{L} \cdot ζ^{E M} + ζ^{T})}} .

From (4.6) in Assumption 4.5 we have, for λ specified in (4.5), terms (i)–(iv) are all upper bounded by $max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)$ . Hence, we conclude term (iii).a in (C.51) is o_ℙ(1). Also, for term (iii).b in (C.51), we have

| ŵ_{0}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot ŵ_{0} - {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot w^{*} | \leq | {(ŵ_{0} - w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, γ} \cdot (ŵ_{0} - w^{*}) | + 2 \cdot | w^{*} \cdot {[I (β^{*})]}_{γ, γ} \cdot (ŵ_{0} - w^{*}) | \leq {‖ ŵ_{0} - w^{*} ‖}_{1}^{2} \cdot {‖ I (β^{*}) ‖}_{\infty, \infty} + 2 \cdot {‖ ŵ_{0} - w^{*} ‖}_{1} \cdot {‖ w^{*} ‖}_{1} \cdot {‖ I {(β)}^{*} ‖}_{\infty, \infty} = O_{ℙ} [{(s_{w}^{*} \cdot λ)}^{2} + {‖ w^{*} ‖}_{1} \cdot s_{w}^{*} \cdot λ],

where the last equality follows from Lemma C.3 and (C.50). Moreover, by (4.6) in Assumption 4.5 we have $max {s_{w}^{*} \cdot λ, {‖ w^{*} ‖}_{1} \cdot s_{w}^{*} \cdot λ} \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)$ . Therefore, we conclude that term (iii).b in (C.51) is o_ℙ(1). Combining the above analysis for terms (i)–(iii) in (C.42), we obtain

| {[T_{n} ({\hat{β}}_{0})]}_{α | γ} + {[I (β^{*})]}_{α | γ} | = o_{ℙ} (1) .

Thus we conclude the proof of Lemma 5.4.

C.5 Proof of Lemma 5.5

Proof

Our proof strategy is as follows. Recall that w* is defined in (4.1). First we prove

\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] = - \sqrt{n} \cdot {[I (β^{*})]}_{α | γ}^{- 1} \cdot {{[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ}} + o_{ℙ} (1) .

(C.53)

Here note that [I(β*)]_α|γ is defined in (4.2) and β̂ = [α̂, γ̂^⊤]^⊤ is the estimator attained by the high dimensional EM algorithm (Algorithm 4). Then we prove

\sqrt{n} \cdot {[I (β^{*})]}_{α | γ}^{- 1} \cdot {{[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ}} \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}^{- 1}) .

(C.54)

Proof of (C.53)

For notational simplicity, we define

{\bar{S}}_{n} (β, ŵ) = {[\nabla_{1} Q_{n} (β; β)]}_{α} - ŵ^{⊤} \cdot {[\nabla_{1} Q_{n} (β; β)]}_{γ}, where ŵ = ω (\hat{β}, λ) .

(C.55)

By the definition of S_n(·,·) in (2.9), we have S̅_n (β̂, ŵ) = S_n (β̂, λ). Let β̃ = (α*, γ̂^⊤)^⊤. The Taylor expansion of S̅_n (β̂, ŵ) takes the form

{\bar{S}}_{n} (\hat{β}, \hat{w}) = {\bar{S}}_{n} (\tilde{β}, ŵ) + (\hat{α} - α^{*}) \cdot {[\nabla {\bar{S}}_{n} (β^{♯}, ŵ)]}_{α},

(C.56)

where β^♯ is an intermediate value between β̂ and β̃. By (C.55) and the definition of T_n(·) in (2.8), we have

{[\nabla {\bar{S}}_{n} (\hat{β}, \hat{w})]}_{α} = {[\nabla_{1, 1} Q_{n} (\hat{β}; \hat{β}) + \nabla_{1, 2} Q_{n} (\hat{β}; \hat{β})]}_{α, α} - ŵ^{⊤} \cdot {[\nabla_{1, 1} Q_{n} (\hat{β}; \hat{β}) + \nabla_{1, 2} Q_{n} (\hat{β}; \hat{β})]}_{γ, α} = {[T (\hat{β})]}_{α, α} - ŵ^{⊤} \cdot {[T (\hat{β})]}_{γ, α} .

(C.57)

By (C.57) and the definition in (2.18), we have

\bar{α} (\hat{β}, λ) = \hat{α} - {{[T (\hat{β})]}_{α, α} - ŵ^{⊤} \cdot {[T (\hat{β})]}_{γ, α}}^{- 1} \cdot {\bar{S}}_{n} (\hat{β}, ŵ) = \hat{α} - {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {\bar{S}}_{n} (\hat{β}, ŵ) .

Further by (C.56) we obtain

\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] = \sqrt{n} \cdot (\hat{α} - α^{*}) - \sqrt{n} \cdot {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {\bar{S}}_{n} (\hat{β}, ŵ) = \underset{(i)}{\underset{︸}{- \sqrt{n} \cdot {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {\bar{S}}_{n} (\tilde{β}, ŵ)}} + \underset{(i i)}{\underset{︸}{\sqrt{n} \cdot (\hat{α} - α^{*}) \cdot {1 - {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {[\nabla {\bar{S}}_{n} (β^{♯}, ŵ)]}_{α}}}} .

(C.58)

Analysis of Term (i)

For term (i) in (C.58), in the sequel we first prove

{[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α} = - {[I (β^{*})]}_{α | γ} + o_{P} (1) .

(C.59)

By the definition of [I(β*)]_α|γ in (4.1) and the definition of w* in (C.53), we have

{[I (β^{*})]}_{α | γ} = {[I (β^{*})]}_{α, α} - {[I (β^{*})]}_{γ, α}^{⊤} \cdot {[I (β^{*})]}_{γ, γ}^{- 1} \cdot {[I (β^{*})]}_{γ, α} = {[I (β^{*})]}_{α, α} - {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} .

Together with (C.57), by triangle inequality we obtain

| {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α} + {[I (β^{*})]}_{α | γ} | \leq \underset{(i) . a}{\underset{︸}{| {[T_{n} (\hat{β})]}_{α, α} + {[I (β^{*})]}_{α, α} |}} + \underset{(i) . b}{\underset{︸}{| ŵ^{⊤} \cdot {[T_{n} (\hat{β})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} |}} .

(C.60)

Note that term (i).a in (C.60) is upper bounded by

{‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} \leq {‖ T_{n} (\hat{β}) - T_{n} (β^{*}) ‖}_{\infty, \infty} + {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} .

(C.61)

Here the inequality is from Lemma 2.1 and triangle inequality. For the first term on the right-hand side of (C.61), by Conditions 4.1 and 4.4 we have

{‖ T_{n} (\hat{β}) - T_{n} (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L}) \cdot {‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (ζ^{L} \cdot ζ^{E M}) .

(C.62)

For the second term on the right-hand side of (C.61), by Condition 4.3 we have

{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} = O_{ℙ} (ζ^{T}) .

(C.63)

Plugging (C.62) and (C.63) into (C.61), we obtain

{‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} (ζ^{L} \cdot ζ^{E M} + ζ^{T}) .

(C.64)

By (4.6) in Assumption 4.5, we have

ζ^{L} \cdot ζ^{E M} + ζ^{T} \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)

for λ specified in (4.5). Hence, we conclude that term (i).a in (C.60) is o𝕡(1). Meanwhile, for term (i).b in (C.60), by triangle inequality we have

| {\hat{w}}^{⊤} \cdot {[T_{n} (\hat{β})]}_{γ, α} + {(w^{*})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} | \leq | {(\hat{w} - w^{*})}^{⊤} \cdot {[T_{n} (\hat{β}) + I (β^{*})]}_{γ, α} | + | {(w^{*} - \hat{w})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} | + | {(w^{*})}^{⊤} \cdot {[T_{n} (\hat{β}) + I (β^{*})]}_{γ, α} | .

(C.65)

For the first term on the right hand-side of (C.65), we have

| {(\hat{w} - w^{*})}^{⊤} \cdot {[T_{n} (\hat{β}) + I (β^{*})]}_{γ, α} | \leq {‖ \hat{w} - w^{*} ‖}_{1} \cdot {‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} = O_{ℙ} [s_{w}^{*} \cdot λ \cdot (ζ^{T} + ζ^{L} \cdot ζ^{E M})] = o_{ℙ} (1) .

The inequality is from Hölder’s inequality, the second last equality is from Lemma C.3 and (C.64), and the last equality holds because (4.6) in Assumption 4.5 implies

max {s_{w}^{*} \cdot λ, ζ^{T} + ζ^{L} \cdot ζ^{E M}} \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)

for λ specified in (4.5). For the second term on the right hand-side of (C.65), we have

| {(w^{*} - \hat{w})}^{⊤} \cdot {[I (β^{*})]}_{γ, α} | \leq {‖ \hat{w} - w^{*} ‖}_{1} \cdot {‖ I (β^{*}) ‖}_{\infty, \infty} \leq O_{ℙ} (s_{w}^{*} \cdot λ) \cdot {‖ I (β^{*}) ‖}_{2} = o_{ℙ} (1),

where the second inequality is from Lemma C.3, while the last equality follows from (4.4) and (4.6) in Assumption 4.5. For the third term on the right hand-side of (C.65), we have

| {(w^{*})}^{⊤} \cdot {[T_{n} (\hat{β}) + I (β^{*})]}_{γ, α} | \leq {‖ w^{*} ‖}_{1} \cdot {‖ T_{n} (\hat{β}) + I (β^{*}) ‖}_{\infty, \infty} = {‖ w^{*} ‖}_{1} \cdot O_{ℙ} (ζ^{L} \cdot ζ^{E M} + ζ^{T}) = o_{ℙ} (1) .

Here the inequality is from Hölder’s inequality, the first equality is from (C.64) and the last equality holds because (4.6) in Assumption 4.5 implies

{‖ w^{*} ‖}_{1} \cdot (ζ^{L} \cdot ζ^{E M} + ζ^{T}) \leq max {{‖ w^{*} ‖}_{1}, 1} \cdot s_{w}^{*} \cdot λ = o (1)

for λ specified in (4.5). By (C.65), we conclude that term (i).b in (C.60) is o𝕡(1). Hence, we obtain (C.59). Furthermore, for term (i) in (C.58), we then replace β̂₀ = (0, γ̂^⊤)^⊤ with β̃ = (α*, γ̂^⊤)^⊤ and ŵ₀ = w (β̂₀, λ) with ŵ = w (β̂, λ) in the proof of (C.29) in §C.3. We obtain

\sqrt{n} \cdot {\bar{S}}_{n} (\tilde{β}, \hat{w}) = \sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} + o_{ℙ} (1),

which further implies

- \sqrt{n} \cdot {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {\bar{S}}_{n} (\tilde{β}, ŵ) = - \sqrt{n} \cdot {[I (β^{*})]}_{α | γ}^{- 1} \cdot {{[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ}} + o_{ℙ} (1)

by (C.59) and Slutsky’s theorem.

Analysis of Term (ii)

Now we prove that term (ii) in (C.58) is o𝕡(1). We have

\sqrt{n} \cdot (\hat{α} - α^{*}) \cdot {1 - {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {[\nabla {\bar{S}}_{n} (β^{♯}, ŵ)]}_{α}} \leq \sqrt{n} \cdot \underset{(i i) . a}{\underset{︸}{| \hat{α} - α^{*} |}} \cdot \underset{(i i) . b}{\underset{︸}{| 1 - {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {[\nabla {\bar{S}}_{n} (β^{♯}, ŵ)]}_{α} |}} .

(C.66)

For term (ii).a in (C.66), by Condition 4.1 and (4.6) in Assumption 4.5 we have

| \hat{α} - α^{*} | \leq {‖ \hat{β} - β^{*} ‖}_{1} = O_{ℙ} (ζ^{E M}) = o_{ℙ} (1) .

(C.67)

Meanwhile, by replacing β̂ with β̃ = (α*, γ̂^⊤)^⊤ in the proof of (C.59) we obtain

{[\nabla {\bar{S}}_{n} (\tilde{β}, ŵ)]}_{α} = - {[I (β^{*})]}_{α | γ} + o_{ℙ} (1) .

(C.68)

Combining (C.59) and (C.68), for term (ii).b in (C.66) we obtain

| 1 - {[\nabla {\bar{S}}_{n} (\hat{β}, ŵ)]}_{α}^{- 1} \cdot {[\nabla {\bar{S}}_{n} (β^{♯}, ŵ)]}_{α} | = o_{ℙ} (1),

which together with (C.67) implies that term (ii) in (C.58) is o𝕡(1). Plugging the above results into terms (i) and (ii) in (C.58), we obtain (C.53).

Proof of (C.54)

By (C.30) in §C.3, we have

\sqrt{n} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{α} - \sqrt{n} \cdot {(w^{*})}^{⊤} \cdot {[\nabla_{1} Q_{n} (β^{*}; β^{*})]}_{γ} \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}),

which implies (C.54). Finally, combining (C.53) and (C.54) with Slutsky’s theorem, we obtain

\sqrt{n} \cdot [\bar{α} (\hat{β}, λ) - α^{*}] \overset{D}{\to} N (0, {[I (β^{*})]}_{α | γ}^{- 1}) .

Thus we conclude the proof of Lemma 5.5.

C.6 Proof of Lemma 4.8

Proof

According to Algorithm 4, the final estimator β̂=β^(T) has ŝ nonzero entries. Meanwhile, we have ‖β*‖₀ = s* ≤ ŝ. Hence, we have ${‖ \hat{β} - β^{*} ‖}_{1} \leq 2 \cdot \sqrt{ŝ} \cdot {‖ \hat{β} - β^{*} ‖}_{2}$ . Invoking (3.19) in Theorem 3.7, we obtain ζ^EM.

For Gaussian mixture model, the maximization implementation of the M-step takes the form

M_{n} (β) = \frac{2}{n} \sum_{i = 1}^{n} ω_{β} (y_{i}) \cdot y_{i} - \frac{1}{n} \sum_{i = 1}^{n} y_{i}, and M (β) = 𝔼 [2 \cdot ω_{β} (Y) \cdot Y - Y],

where ω_β(·) is defined in (A.1). Meanwhile, we have

\nabla_{1} Q_{n} (β; β) = \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (y_{i}) - 1] \cdot y_{i} - β, and \nabla_{1} Q (β; β) = 𝔼 [2 \cdot ω_{β} (Y) - Y] - β .

Hence, we have ‖M_n(β) − M(β)‖_∞ = ‖∇₁Q_n(β; β)−∇₁Q(β; β)‖_∞. By setting δ = 2/d in Lemma 3.6, we obtain ζ^G.

C.7 Proof of Lemma 4.9

Proof

Recall that for Gaussian mixture model we have

Q_{n} (β'; β) = - \frac{1}{2 n} \sum_{i = 1}^{n} {ω_{β} (y_{i}) \cdot {‖ y_{i} - β' ‖}_{2}^{2} + [1 - ω_{β} (y_{i})] \cdot {‖ y_{i} + β' ‖}_{2}^{2}},

where ω_β (·) is defined in (A.1). Hence, by calculation we have

\nabla_{1} Q_{n} (β'; β) = \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (y_{i}) - 1] \cdot y_{i} - β', \nabla_{1, 1}^{2} Q_{n} (β'; β) = - I_{d},

(C.69)

\nabla_{1, 2}^{2} Q_{n} (β'; β) = \frac{4}{n} \sum_{i = 1}^{n} \frac{y_{i} \cdot y_{i}^{⊤}}{σ^{2} \cdot [1 + exp (- 2 \cdot 〈 β, y_{i} 〉 / σ^{2})] \cdot [1 + exp (2 \cdot 〈 β, y_{i} 〉 / σ^{2})]} .

(C.70)

For notational simplicity we define

ν_{β} (y) = \frac{4}{σ^{2} \cdot [1 + exp (- 2 \cdot 〈 β, y 〉 / σ^{2})] \cdot [1 + exp (2 \cdot 〈 β, y 〉 / σ^{2})]} .

(C.71)

Then by the definition of T_n(·) in (2.8), from (C.69) and (C.70) we have

{T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} = \frac{1}{n} \sum_{i = 1}^{n} ν_{β^{*}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} - 𝔼_{β^{*}} [ν_{β^{*}} (Y) \cdot Y_{j} \cdot Y_{k}] .

Applying the symmetrization result in Lemma D.4 to the right-hand side, we have that for u > 0,

𝔼_{β^{*}} {exp [u \cdot | {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} |]} \leq 𝔼_{β^{*}} {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot ν_{β^{*}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} |]},

(C.72)

where ξ₁, …, ξ_n are i.i.d. Rademacher random variables that are independent of y₁, …, y_n. Then we invoke the contraction result in Lemma D.5 by setting

f (y_{i, j} \cdot y_{i, k}) = y_{i, j} \cdot y_{i, k}, ℱ = {f}, ψ_{i} (υ) = ν_{β^{*}} (y_{i}) \cdot υ, and ϕ (υ) = exp (u \cdot υ),

where u is the variable of the moment generating function in (C.72). By the definition in (C.71) we have |ν_β*(y_i)| ≤ 4/σ², which implies

| ψ_{i} (υ) - ψ_{i} (υ') | \leq | ν_{β^{*}} (y_{i}) \cdot (υ - υ') | \leq 4 / σ^{2} \cdot | υ - υ' |, for all υ, υ' \in ℝ .

Therefore, applying the contraction result in Lemma D.5 to the right-hand side of (C.72), we obtain

𝔼_{β^{*}} {exp [u \cdot | {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} |]} \leq 𝔼_{β^{*}} {exp [u \cdot 4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} \cdot y_{i, k} |]} .

(C.73)

Note that 𝔼_β*(ξ_i · y_i,j · y_i,k) = 0, since ξ_i is a Rademacher random variable independent of y_i,j · y_i,k. Recall that in Gaussian mixture model we have $y_{i, j} = z_{i} \cdot β_{j}^{*} + υ_{i, j}$ , where z_i is a Rademacher random variable and υ_i,j ~ N(0, σ²). Hence, by Example 5.8 in Vershynin (2010), we have ${‖ z_{i} \cdot β_{j}^{*} ‖}_{ψ_{2}} \leq | β_{j}^{*} |$ and ‖ υ_i,j‖_ψ2 ≤ C · σ. Therefore, by Lemma D.1 we have

{‖ y_{i, j} ‖}_{ψ_{2}} = {‖ z_{i} \cdot β_{j}^{*} + υ_{i, j} ‖}_{ψ_{2}} \leq C' \cdot \sqrt{{| β_{j}^{*} |}^{2} + C ″ \cdot σ^{2}} \leq C' \cdot \sqrt{{‖ β^{*} ‖}_{\infty}^{2} + C ″ \cdot σ^{2}} .

(C.74)

Since |ξ_i · y_i,j · y_i,k|=|y_i,j · y_i,k|, by definition ξ_i · y_i,j · y_i,k and y_i,j · y_i,k have the same ψ₁-norm. By Lemma D.2 we have

{‖ ξ_{i} \cdot y_{i, j} \cdot y_{i, k} ‖}_{ψ_{1}} \leq C \cdot max {{‖ y_{i, j} ‖}_{ψ_{2}}^{2}, {‖ y_{i, k} ‖}_{ψ_{2}}^{2}} \leq C' \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C ″ \cdot σ^{2}) .

Then by Lemma 5.15 in Vershynin (2010), we obtain

𝔼_{β^{*}} [exp (u' \cdot ξ_{i} \cdot y_{i, j} \cdot y_{i, k})] \leq exp [{(u')}^{2} \cdot C \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})]

(C.75)

for all $| u' | \leq C ″ / ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})$ . Note that on the right-hand side of (C.73), we have

𝔼_{β^{*}} {exp [u \cdot 4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} \cdot y_{i, k} |]} \leq 𝔼_{β^{*}} (max {exp [u \cdot 4 / σ^{2} \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} \cdot y_{i, k}], exp [- u \cdot 4 / σ^{2} \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} \cdot y_{i, k}]}) \leq 𝔼_{β^{*}} {exp [u \cdot 4 / σ^{2} \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} \cdot y_{i, k}]} + 𝔼_{β^{*}} {exp [- u \cdot 4 / σ^{2} \cdot \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot y_{i, j} \cdot y_{i, k}]} .

(C.76)

By plugging (C.75) into the right-hand side of (C.76) with u′ = u · 4/(σ² · n) and u′ = −u · 4/(σ² · n), from (C.73) we have that for any j, k ∈ {1, …, d},

𝔼_{β^{*}} {exp [u \cdot | {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} |]} \leq 2 \cdot exp [C \cdot u^{2} / n \cdot {({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})}^{2} / σ^{4}] .

(C.77)

Therefore, by Chernoff bound we have that, for all υ > 0 and $| u' | \leq C ″ / ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})$ ,

ℙ [{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} > υ] \leq 𝔼_{β^{*}} {exp [u \cdot {‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty}]} / exp (u \cdot υ) \leq \sum_{j = 1}^{d} \sum_{k = 1}^{d} 𝔼_{β^{*}} {exp [u \cdot | {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} |]} / exp (u \cdot υ) \leq 2 \cdot exp [C \cdot u^{2} / n \cdot {({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})}^{2} / σ^{4} - u \cdot υ + 2 \cdot log d],

where the last inequality is from (C.77). By minimizing its right-hand side over u, we conclude that for $0 < υ \leq C ″ \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})$ ,

ℙ [{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} > υ] \leq 2 \cdot exp {- n \cdot υ^{2} / [C \cdot {({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})}^{2} / σ^{4}] + 2 \cdot log d} .

Setting the right-hand side to be δ, we have that

{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} \leq υ = C \cdot ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}) / σ^{2} \cdot \sqrt{\frac{log (2 / δ) + 2 \cdot log d}{n}}

holds with probability at least 1 − δ. By setting δ = 2/d, we conclude the proof of Lemma 4.9.

C.8 Proof of Lemma 4.10

Proof

For any j, k ∈ {1, …, d}, by the mean-value theorem we have

{‖ T_{n} (β) - T_{n} (β^{*}) ‖}_{\infty, \infty} = max_{j, k \in {1, \dots, d}} | {[T_{n} (β)]}_{j, k} - {[T_{n} (β^{*})]}_{j, k} | = max_{j, k \in {1, \dots, d}} | {(β - β^{*})}^{⊤} \cdot \nabla {[T_{n} (β^{♯})]}_{j, k} | \leq {‖ β - β^{*} ‖}_{1} \cdot max_{j, k \in {1, \dots, d}} {‖ \nabla {[T_{n} (β^{♯})]}_{j, k} ‖}_{\infty},

(C.78)

where β^♯ is an intermediate value between β and β*. According to (C.69), (C.70) and the definition of T_n(·) in (2.8), by calculation we have

\nabla {[T_{n} (β^{♯})]}_{j, k} = \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i},

where

{\bar{ν}}_{β} (y) = \frac{8 / σ^{4}}{[1 + exp (- 2 \cdot 〈 β, y 〉 / σ^{2})] \cdot {[1 + exp (2 \cdot 〈 β, y 〉 / σ^{2})]}^{2}} - \frac{8 / σ^{4}}{{[1 + exp (- 2 \cdot 〈 β, y 〉 / σ^{2})]}^{2} \cdot [1 + exp (2 \cdot 〈 β, y 〉 / σ^{2})]} .

(C.79)

For notational simplicity, we define the following event

ℰ = {{‖ y_{i} ‖}_{\infty} \leq τ, for all i = 1, \dots, n},

where τ > 0 will be specified later. By maximal inequality we have

ℙ {{‖ \nabla {[T_{n} (β^{♯})]}_{j, k} ‖}_{\infty} > υ} \leq d \cdot ℙ ({| \nabla {[T_{n} (β^{♯})]}_{j, k} |}_{l} > υ) = d \cdot ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} | > υ] .

(C.80)

Let ℰ̅ be the complement of ℰ. On the right-hand side of (C.80) we have

ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} | > υ] = \underset{(i)}{\underset{︸}{ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} | > υ, ℰ]}} + \underset{(i i)}{\underset{︸}{ℙ (\bar{ℰ})}} .

(C.81)

Analysis of Term (i)

For term (i) in (C.81), we have

ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} | > υ, ℰ] = ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} \cdot 𝟙 {{‖ y_{i} ‖}_{\infty} \leq τ} | > υ, ℰ] \leq ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} \cdot 𝟙 {{‖ y_{i} ‖}_{\infty} \leq τ} | > υ] \leq \sum_{i = 1}^{n} ℙ [| {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} \cdot 𝟙 {{‖ y_{i} ‖}_{\infty} \leq τ} | > υ],

where the last inequality is from union bound. By (C.79) we have |ν̅_β^♯(y_i)|≤16/σ⁴. Thus we obtain

ℙ [| {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} \cdot 𝟙 {{‖ y_{i} ‖}_{\infty} \leq τ} | > υ] \leq ℙ [| y_{i, j} \cdot y_{i, k} \cdot y_{i, l} \cdot 𝟙 {{‖ y_{i} ‖}_{\infty} \leq τ} | > σ^{4} / 16 \cdot υ] .

Taking υ = 16 · τ³/σ⁴, we have that the right-hand side is zero and hence term (i) in (C.81) is zero.

Analysis of Term (ii)

Meanwhile, for term (ii) in (C.81) by maximal inequality we have

ℙ (\bar{ℰ}) = ℙ (max_{i \in {1, \dots, n}} {‖ y_{i} ‖}_{\infty} > τ) \leq n \cdot ℙ ({‖ y_{i} ‖}_{\infty} > τ) \leq n \cdot d \cdot ℙ (| y_{i, j} | > τ) .

Furthermore, by (C.74) in the proof of Lemma 4.8, we have that y_i,j is sub-Gaussian with ${‖ y_{i, j} ‖}_{ψ_{2}} = C \cdot \sqrt{{‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}}$ . Therefore, by Lemma 5.5 in Vershynin (2010) we have

ℙ (\bar{ℰ}) \leq n \cdot d \cdot ℙ (| y_{i, j} | > τ) \leq n \cdot d \cdot 2 \cdot exp [- C \cdot τ^{2} / ({‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2})] .

To ensure the right-hand side is upper bounded by δ, we set τ to be

τ = C \cdot \sqrt{{‖ β^{*} ‖}_{\infty}^{2} + C' \cdot σ^{2}} \cdot \sqrt{log d + log n + log (2 / δ)} .

(C.82)

Finally, by (C.80), (C.81) and maximal inequality we have

ℙ {max_{j, k \in {1, \dots, d}} {‖ \nabla {[T_{n} (β^{♯})]}_{j, k} ‖}_{\infty} > υ} \leq d^{2} \cdot d \cdot ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (y_{i}) \cdot y_{i, j} \cdot y_{i, k} \cdot y_{i, l} | > υ] \leq d^{3} \cdot δ

for υ = 16 · τ³/σ⁴ with τ specified in (C.82). By setting δ = 2 · d⁻⁴ and plugging (C.82) into (C.78), we conclude the proof of Lemma 4.10.

C.9 Proof of Lemma 4.12

By the same proof of Lemma 4.8 in §C.6, we obtain ζ^EM by invoking Theorem 3.10. To obtain ζ^G, note that for the gradient implementation of the M-step (Algorithm 3), we have

M_{n} (β) = β + η \cdot \nabla_{1} Q_{n} (β; β), and M (β) = β + η \cdot \nabla_{1} Q (β; β) .

Hence, we obtain ‖∇₁Q_n(β*; β*) − ∇₁Q(β*; β*)‖_∞ = 1/η · ‖M_n (β*) − M(β*)‖ _∞. Setting δ = 4/d and s = s* in (3.22) of Lemma 3.9, we then obtain ζ^G.

C.10 Proof of Lemma 4.13

Proof

Recall that for mixture of regression model, we have

Q_{n} (β'; β) = - \frac{1}{2 n} \sum_{i = 1}^{n} {ω_{β} (x_{i}, y_{i}) \cdot {(y_{i} - 〈 x_{i}, β' 〉)}^{2} + [1 - ω_{β} (x_{i}, y_{i})] \cdot {(y_{i} + 〈 x_{i}, β' 〉)}^{2}},

where ω_β(·) is defined in (A.3). Hence, by calculation we have

\nabla_{1} Q_{n} (β', β) = \frac{1}{n} \sum_{i = 1}^{n} [2 \cdot ω_{β} (x_{i}, y_{i}) \cdot y_{i} \cdot x_{i} - x_{i} \cdot x_{i}^{⊤} \cdot β'], \nabla_{1, 1}^{2} Q_{n} (β', β) = - \frac{1}{n} \sum_{i = 1}^{n} x_{i} \cdot x_{i}^{⊤},

(C.83)

\nabla_{1, 2}^{2} Q_{n} (β', β) = \frac{4}{n} \sum_{i = 1}^{n} \frac{y_{i}^{2} \cdot x_{i} \cdot x_{i}^{⊤}}{σ^{2} \cdot [1 + exp (- 2 \cdot y_{i} \cdot 〈 β, x_{i} 〉 / σ^{2})] \cdot [1 + exp (2 \cdot y_{i} \cdot 〈 β, x_{i} 〉 / σ^{2})]} .

(C.84)

For notational simplicity we define

ν_{β} (x, y) = \frac{4}{σ^{2} \cdot [1 + exp (- 2 \cdot y \cdot 〈 β, x 〉 / σ^{2})] \cdot [1 + exp (2 \cdot y \cdot 〈 β, x 〉 / σ^{2})]} .

(C.85)

Then by the definition of T_n(·) in (2.8), from (C.83) and (C.84) we have

{T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} = \frac{1}{n} \sum_{i = 1}^{n} ν_{β^{*}} (x_{i}, y_{i}) \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} - 𝔼_{β^{*}} [ν_{β^{*}} (Y, X) \cdot X_{j} \cdot X_{k} \cdot Y_{i}^{2}] .

Applying the symmetrization result in Lemma D.4 to the right-hand side, we have that for u > 0,

𝔼_{β^{*}} {exp [u \cdot | {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} |]} \leq 𝔼_{β^{*}} {exp [u \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot ν_{β^{*}} (x_{i}, y_{i}) \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} |]},

(C.86)

where ξ₁, …, ξ_n are i.i.d. Rademacher random variables, which are independent of x₁, …, x_n and y₁, …, y_n. Then we invoke the contraction result in Lemma D.5 by setting

f (x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2}) = x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2}, ℱ = {f}, ψ_{i} (υ) = ν_{β^{*}} (x_{i}, y_{i}) \cdot υ, and ϕ (υ) = exp (u \cdot υ),

where u is the variable of the moment generating function in (C.86). By the definition in (C.85) we have |ν_β*(x_i, y_i)| ≤ 4/σ², which implies

| ψ_{i} (υ) - ψ_{i} (υ') | \leq | ν_{β^{*}} (x_{i}, y_{i}) \cdot (υ - υ') | \leq 4 / σ^{2} \cdot | υ - υ' |, for all υ, υ' \in ℝ .

Therefore, applying Lemma D.5 to the right-hand side of (C.86), we obtain

𝔼_{β^{*}} {exp [u \cdot | {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} |]} \leq 𝔼_{β^{*}} {exp [u \cdot 4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} |]} .

(C.87)

For notational simplicity, we define the following event

ℰ = {{‖ x_{i} ‖}_{\infty} \leq τ, for all i = 1, \dots, n} .

Let ℰ̅ be the complement of ℰ. We consider the following tail probability

ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} | > υ] \leq \underset{(i)}{\underset{︸}{ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} | > υ, ℰ]}} + \underset{(i i)}{\underset{︸}{ℙ (\bar{ℰ})}} .

(C.88)

Analysis of Term (i)

For term (i) in (C.88), we have

ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} | > υ, ℰ] = ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} | > υ, ℰ] \leq ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} | > υ] .

Here note that $𝔼_{β^{*}} (ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ}) = 0$ , because ξ_i is a Rademacher random variable independent of x_i and y_i. Recall that for mixture of regression model we have y_i = z_i · 〈β*, x_i〉 + υ_i, where z_i is a Rademacher random variable, x_i ~ N(0, I_d) and υ_i ~ N(0, σ²). According to Example 5.8 in Vershynin (2010), we have ‖z_i · 〈β*, x_i〉 · 𝟙 {‖x_i‖_∞ ≤ τ}‖_ψ₂ = ‖〈β*, x_i〉 · 𝟙 {‖x_i‖_∞ ≤ τ}‖_ψ₂ ≤ τ · ‖β*‖₁ and ‖υ_i · 𝟙 {‖x_i‖_∞ ≤ τ}‖_ψ₂ ≤ ‖υ_i‖_ψ₂ ≤ C · σ. Hence, by Lemma D.1 we have

{‖ y_{i} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} ‖}_{ψ_{2}} = {‖ z_{i} \cdot 〈 β^{*}, x_{i} 〉 \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} + υ_{i} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} ‖}_{ψ_{2}} \leq C \cdot \sqrt{τ^{2} \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2}} .

(C.89)

By the definition of ψ₁-norm, we have ${‖ ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} ‖}_{ψ_{1}} \leq τ^{2} \cdot {‖ y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} ‖}_{ψ_{1}}$ . Further by applying Lemma D.2 to its right-hand side with Z₁ = Z₂ = y_i · 𝟙 {‖x_i‖_∞≤τ}, we obtain

{‖ ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} ‖}_{ψ_{1}} \leq C \cdot τ^{2} \cdot {‖ y_{i} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} ‖}_{ψ_{2}}^{2} \leq C' \cdot τ^{2} \cdot (τ^{2} \cdot {‖ β^{*} ‖}_{1}^{2} + C ″ \cdot σ^{2}),

where the last inequality follows from (C.89). Therefore, by Bernstein’s inequality (Proposition 5.16 in Vershynin (2010)), we obtain

ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} | > υ] \leq 2 \cdot exp [- \frac{C \cdot n \cdot υ^{2} \cdot σ^{4}}{τ^{4} \cdot {(τ^{2} \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2})}^{2}}],

(C.90)

for all $0 \leq υ \leq C \cdot τ^{2} \cdot ({‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2})$ and a sufficiently large sample size n.

Analysis of Term (ii)

For term (ii) in (C.88), by union bound we have

ℙ (\bar{ℰ}) = ℙ (max_{1 \leq i \leq n} {‖ x_{i} ‖}_{\infty} > τ) \leq n \cdot ℙ ({‖ x_{i} ‖}_{\infty} > τ) \leq n \cdot d \cdot ℙ (| x_{i, j} | > τ) .

Moreover, x_i,j is sub-Gaussian with ‖x_i,j‖_ψ₂ = C. Thus, by Lemma 5.5 in Vershynin (2010) we have

ℙ (\bar{ℰ}) \leq n \cdot d \cdot 2 \cdot exp (- C' \cdot τ^{2}) = 2 \cdot exp (- C' \cdot τ^{2} + log n + log d) .

(C.91)

Plugging (C.90) and (C.91) into (C.88), we obtain

ℙ [4 / σ^{2} \cdot | \frac{1}{n} \sum_{i = 1}^{n} ξ_{i} \cdot x_{i, j} \cdot x_{i, k} \cdot y_{i}^{2} | > υ] \leq 2 \cdot exp [- \frac{C \cdot n \cdot υ^{2} \cdot σ^{4}}{τ^{4} \cdot {(τ^{2} \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2})}^{2}}] + 2 \cdot exp (- C ″ \cdot τ^{2} + log n + log d) .

(C.92)

Note that (C.87) is obtained by applying Lemmas D.4 and D.5 with ϕ(υ) = exp(u·υ). Since Lemmas D.4 and D.5 allow any increasing convex function ϕ(·), similar results hold correspondingly. Hence, applying Panchenko’s theorem in Lemma D.6 to (C.87), from (C.92) we have

ℙ [| {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} | > υ] \leq 2 \cdot e \cdot exp [- \frac{C \cdot n \cdot υ^{2} \cdot σ^{4}}{τ^{4} \cdot {(τ^{2} \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2})}^{2}}] + 2 \cdot e \cdot exp (- C ″ \cdot τ^{2} + log n + log d) .

Furthermore, by union bound we have

ℙ [{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} > υ] \leq \sum_{j = 1}^{d} \sum_{k = 1}^{d} ℙ [| {T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})]}_{j, k} | > υ] \leq 2 \cdot e \cdot exp [- \frac{C \cdot n \cdot υ^{2} \cdot σ^{4}}{τ^{4} \cdot {(τ^{2} \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2})}^{2}} + 2 \cdot log d] + 2 \cdot e \cdot exp (- C ″ \cdot τ^{2} + log n + 3 \cdot log d) .

(C.93)

To ensure the right-hand side is upper bounded by δ, we set the second term on the right-hand side of (C.93) to be δ/2. Then we obtain

τ = C \cdot \sqrt{\log n + 3 \cdot log d + log (4 \cdot e / δ)} .

Let the first term on the right-hand side of (C.93) be upper bounded by δ/2 and plugging in τ, we then obtain

υ = C \cdot [log n + 3 \cdot log d + log (4 \cdot e / δ)] \cdot {[log n + 3 \cdot log d + log (4 \cdot e / δ)] \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2}} / σ^{2} \cdot \sqrt{\frac{log (4 \cdot e / δ) + 2 \cdot log d}{n}} .

Therefore, by setting δ = 4 · e/d we have that

{‖ T_{n} (β^{*}) - 𝔼_{β^{*}} [T_{n} (β^{*})] ‖}_{\infty, \infty} \leq υ = C \cdot (log n + 4 \cdot log d) \cdot [(log n + 4 \cdot log d) \cdot {‖ β^{*} ‖}_{1}^{2} + C' \cdot σ^{2}] / σ^{2} \cdot \sqrt{\frac{log d}{n}}

holds with probability at least 1 – 4 · e/d, which completes the proof of Lemma 4.13.

C.11 Proof of Lemma 4.14

Proof

For any j, k ∈ {1, …, d}, by the mean-value theorem we have

{‖ T_{n} (β) - T_{n} (β^{*}) ‖}_{\infty, \infty} = max_{j, k \in {1, \dots, d}} | {[T_{n} (β)]}_{j, k} - {[T_{n} (β^{*})]}_{j, k} | = max_{j, k \in {1, \dots, d}} | {(β - β^{*})}^{⊤} \cdot \nabla {[T_{n} (β^{♯})]}_{j, k} | \leq {‖ β - β^{*} ‖}_{1} \cdot max_{j, k \in {1, \dots, d}} {‖ \nabla {[T_{n} (β^{♯})]}_{j, k} ‖}_{\infty},

(C.94)

where β^♯ is an intermediate value between β and β*. According to (C.83), (C.84) and the definition of T_n(·) in (2.8), by calculation we have

\nabla {[T_{n} (β^{♯})]}_{j, k} = \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i},

where

{\bar{ν}}_{β} (x, y) = \frac{8 / σ^{4}}{[1 + exp (- 2 \cdot y \cdot 〈 β, x 〉 / σ^{2})] \cdot {[1 + exp (2 \cdot y \cdot 〈 β, x 〉 / σ^{2})]}^{2}} - \frac{8 / σ^{4}}{{[1 + exp (- 2 \cdot y \cdot 〈 β, x 〉 / σ^{2})]}^{2} \cdot [1 + exp (2 \cdot y \cdot 〈 β, x 〉 / σ^{2})]} .

(C.95)

For notational simplicity, we define the following events

ℰ = {{‖ x_{i} ‖}_{\infty} \leq τ, for all i = 1, \dots, n}, and ℰ' = {| υ_{i} | \leq τ', for all i = 1, \dots, n},

where τ > 0 and τ′ > 0 will be specified later. By union bound we have

ℙ {{‖ \nabla {[T_{n} (β^{♯})]}_{j, k} ‖}_{\infty} > υ} \leq d \cdot ℙ ({| \nabla {[T_{n} (β^{♯})]}_{j, k} |}_{l} > υ) = d \cdot ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} | > υ] .

(C.96)

Let ℰ̅ and ℰ̅′ be the complement of ℰ and ℰ′ respectively. On the right-hand side we have

ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} | > υ] = \underset{(i)}{\underset{︸}{ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} | > υ, ℰ, ℰ']}} + \underset{(i i)}{\underset{︸}{ℙ (\bar{ℰ})}} + \underset{(i i i)}{\underset{︸}{ℙ (\bar{ℰ}')}} .

(C.97)

Analysis of Term (i)

For term (i) in (C.97), we have

ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} | > υ, ℰ, ℰ'] = ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} \cdot 𝟙 {| υ_{i} | \leq τ'} | > υ, ℰ, ℰ'] \leq ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} \cdot 𝟙 {| υ_{i} | \leq τ'} | > υ] .

To avoid confusion, note that υ_i is the noise in mixture of regression model, while υ appears in the tail bound. By applying union bound to the right-hand side of the above inequality, we have

ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} | > υ, ℰ, ℰ'] \leq \sum_{i = 1}^{n} ℙ [| {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} \cdot 𝟙 {| υ_{i} | \leq τ'} | > υ] .

By (C.95) we have |ν̅_β♯(x_i, y_i)| ≤ 16/σ⁴. Hence, we obtain

ℙ [| {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} \cdot 𝟙 {| υ_{i} | \leq τ'} | > υ] \leq ℙ [| y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} \cdot 𝟙 {| υ_{i} | \leq τ'} | > σ^{4} / 16 \cdot υ] .

(C.98)

Recall that in mixture of regression model we have y_i = z_i · 〈β*, x_i〉 + υ_i, where z_i is a Rademacher random variable, x_i ~ N(0, I_d) and υ_i ~ N(0, σ²). Hence, we have

| y_{i}^{3} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} \cdot 𝟙 {| υ_{i} | \leq τ'} | \leq {(| z_{i} \cdot 〈 β^{*}, x_{i} 〉 \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} | + | υ_{i} \cdot 𝟙 {| υ_{i} | \leq τ'} |)}^{3} \leq {(τ \cdot {‖ β^{*} ‖}_{1} + τ')}^{3},

| x_{i, j} \cdot x_{i, k} \cdot x_{i, l} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} | \leq {| x_{i, j} \cdot 𝟙 {{‖ x_{i} ‖}_{\infty} \leq τ} |}^{3} \leq τ^{3} .

Taking υ = 16 · (τ · ‖β*‖₁ + τ′)³ · τ³/σ⁴, we have that the right-hand side of (C.98) is zero. Hence term (i) in (C.97) is zero.

Analysis of Term (ii)

For term (ii) in (C.97), by union bound we have

ℙ (\bar{ℰ}) = ℙ (max_{i \in {1, \dots, n}} {‖ x_{i} ‖}_{\infty} > τ) \leq n \cdot ℙ ({‖ x_{i} ‖}_{\infty} > τ) \leq n \cdot d \cdot ℙ (| x_{i, j} | > τ) .

Moreover, we have that x_i,j is sub-Gaussian with ‖x_i,j‖_ψ₂ = C. Therefore, by Lemma 5.5 in Vershynin (2010) we have

ℙ (\bar{ℰ}) \leq n \cdot d \cdot ℙ (| x_{i, j} | > τ) \leq n \cdot d \cdot 2 \cdot exp (- C' \cdot τ^{2}) .

Analysis of Term (iii)

Since υ_i is sub-Gaussian with ‖υ_i‖_ψ₂ = C · σ, by Lemma 5.5 in Vershynin (2010) and union bound, for term (iii) in (C.97) we have

ℙ (\bar{ℰ'}) \leq n \cdot ℙ (| υ_{i} | > τ') \leq n \cdot 2 \cdot exp (- C' \cdot τ'^{2} / σ^{2}) .

To ensure the right-hand side of (C.97) is upper bounded by δ, we set τ and τ′ to be

τ = C \cdot \sqrt{log d + log n + log (4 / δ)}, and τ' = C' \cdot σ \cdot \sqrt{log n + log (4 / δ)}

(C.99)

to ensure terms (ii) and (iii) are upper bounded by δ/2 correspondingly. Finally, by (C.96), (C.97) and union bound we have

ℙ {max_{j, k \in {1, \dots, d}} {‖ \nabla {[T_{n} (β^{♯})]}_{j, k} ‖}_{\infty} > υ} \leq d^{2} \cdot d \cdot ℙ [| \frac{1}{n} \sum_{i = 1}^{n} {\bar{ν}}_{β^{♯}} (x_{i}, y_{i}) \cdot y_{i}^{3} \cdot x_{i, j} \cdot x_{i, k} \cdot x_{i, l} | > υ] \leq d^{3} \cdot δ

for υ = 16 · (τ · ‖β*‖₁ + τ′)³ · τ³/σ⁴ with τ and τ′ specified in (C.99). Then by setting δ = 4 · d⁻⁴ and plugging it into (C.99), we have

υ = 16 \cdot {[C \cdot \sqrt{5 \cdot log d + log n} \cdot {‖ β^{*} ‖}_{1} + C' \cdot σ \cdot \sqrt{4 \cdot log d + log n}]}^{3} \cdot {[C \cdot \sqrt{5 \cdot log d + log n}]}^{3} / σ^{4} \leq C ″ \cdot {({‖ β^{*} ‖}_{1} + C ‴ \cdot σ)}^{3} \cdot {(5 \cdot log d + log n)}^{3},

which together with (C.94) concludes the proof of Lemma 4.14.

D Auxiliary Results

In this section, we lay out several auxiliary lemmas. Lemmas D.1–D.3 provide useful properties of sub-Gaussian random variables. Lemmas D.4 and D.5 establish the symmetrization and contraction results. Lemma D.6 is Panchenko’s theorem. For more details of these results, see Vershynin (2010); Boucheron et al. (2013).

Lemma D.1

Let Z₁, …, Z_k be the k independent zero-mean sub-Gaussian random variables, for $Z = \sum_{j = 1}^{k} Z_{j}$ we have ${‖ Z ‖}_{ψ_{2}}^{2} \leq C \cdot \sum_{j = 1}^{k} {‖ Z_{j} ‖}_{ψ_{2}}^{2}$ , where C > 0 is an absolute constant.

Lemma D.2

For Z₁ and Z₂ being two sub-Gaussian random variables, Z₁ · Z₂ is a sub-exponential random variable with

{‖ Z_{1} \cdot Z_{2} ‖}_{ψ_{1}} \leq C \cdot max {{‖ Z_{1} ‖}_{ψ_{2}}^{2}, {‖ Z_{2} ‖}_{ψ_{2}}^{2}},

where C > 0 is an absolute constant.

Lemma D.3

For Z being sub-Gaussian or sub-exponential, it holds that ‖Z − 𝔼Z‖_ψ₂ ≤ 2 · ‖Z‖_ψ₂ or ‖Z − 𝔼Z‖_ψ₁ ≤ 2 · ‖Z‖_ψ₁ correspondingly.

Lemma D.4

Let z₁, …, z_n be the n independent realizations of the random vector Z ∈ 𝒵 and ℱ be a function class defined on 𝒵. For any increasing convex function ϕ(·) we have

𝔼 {ϕ [sup_{f \in ℱ} | \sum_{i = 1}^{n} f (z_{i}) - 𝔼 Z |]} \leq 𝔼 {ϕ [sup_{f \in ℱ} | \sum_{i = 1}^{n} ξ_{i} \cdot f (z_{i}) |]},

where ξ₁, .…, ξ_n are i.i.d. Rademacher random variables that are independent of z₁, …, z_n.

Lemma D.5

Let z₁, …, z_n be the n independent realizations of the random vector Z ∈ 𝒵 and ℱ be a function class defined on 𝒵. We consider the Lipschitz functions ψ_i(·) (i = 1, …, n) that satisfy

| ψ_{i} (υ) - ψ_{i} (υ') | \leq L \cdot | υ - υ' |, for all υ, υ' \in ℝ,

and ψ_i(0) = 0. For any increasing convex function ϕ(·) we have

𝔼 {ϕ [| sup_{f \in ℱ} \sum_{i = 1}^{n} ξ_{i} \cdot ψ_{i} [f (z_{i})] |]} \leq 𝔼 {ϕ [2 \cdot | L \cdot sup_{f \in ℱ} \sum_{i = 1}^{n} ξ_{i} \cdot f (z_{i}) |]},

where ξ₁, …, ξ_n are i.i.d. Rademacher random variables that are independent of z₁, …, z_n.

Lemma D.6

Suppose that Z₁ and Z₂ are two random variables that satisfy 𝔼[ϕ(Z₁)] ≤ 𝔼[ϕ(Z₂)] for any increasing convex function ϕ(·). Assuming that ℙ(Z₁ ≥ υ) ≤ C · exp(−C′ · υ^α) (α ≥ 1) holds for all υ ≥ 0, we have ℙ(Z₂ ≥ υ) ≤ C · exp(1 − C′ · υ^α).

References

Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research. 2014a;15:2773–2832. [Google Scholar]
Anandkumar A, Ge R, Janzamin M. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. arXiv preprint arXiv:1411.1488. 2014b [Google Scholar]
Anandkumar A, Ge R, Janzamin M. Provable learning of overcomplete latent variable models: Semi-supervised and unsupervised settings. arXiv preprint arXiv:1408.0553. 2014c [Google Scholar]
Balakrishnan S, Wainwright MJ, Yu B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156. 2014 [Google Scholar]
Bartholomew DJ, Knott M, Moustaki I. Latent variable models and factor analysis: A Unified Approach. Vol. 899. Wiley; 2011. [Google Scholar]
Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]
Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies. 2014;81:608–650. [Google Scholar]
Belloni A, Chernozhukov V, Wei Y. Honest confidence regions for a regression parameter in logistic regression with a large number of controls. arXiv preprint arXiv:1304.3969. 2013 [Google Scholar]
Bickel PJ. One-step Huber estimates in the linear model. Journal of the American Statistical Association. 1975;70:428–434. [Google Scholar]
Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37:1705–1732. [Google Scholar]
Boucheron S, Lugosi G, Massart P. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013. [Google Scholar]
Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]
Candès E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351. [Google Scholar]
Chaganty AT, Liang P. Spectral experts for estimating mixtures of linear regressions. arXiv preprint arXiv:1306.3729. 2013 [Google Scholar]
Chaudhuri K, Dasgupta S, Vattani A. Learning mixtures of Gaussians using the k-means algorithm. arXiv preprint arXiv:0912.0086. 2009 [Google Scholar]
Chrétien S, Hero AO. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics. 2008;12:308–326. [Google Scholar]
Dasgupta S, Schulman L. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research. 2007;8:203–226. [Google Scholar]
Dasgupta S, Schulman LJ. A two-round variant of EM for Gaussian mixtures. In Uncertainty in artificial intelligence, 2000. 2000 [Google Scholar]
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1977;39:1–38. [Google Scholar]
Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research. 2014;15:2869–2909. [Google Scholar]
Khalili A, Chen J. Variables selection in finite mixture of regression models. Journal of the American Statistical Association. 2007;102:1025–1038. [Google Scholar]
Knight K, Fu W. Asymptotics for Lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]
Lee JD, Sun DL, Sun Y, Taylor JE. Exact inference after model selection via the Lasso. arXiv preprint arXiv:1311.6238. 2013 [Google Scholar]
Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the Lasso. The Annals of Statistics. 2014;42:413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
Loh P-L, Wainwright MJ. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression; IEEE International Symposium on Information Theory, 2012; 2012. [Google Scholar]
McLachlan G, Krishnan T. The EM algorithm and extensions. Vol. 382. Wiley; 2007. [Google Scholar]
Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2010;72:417–473. [Google Scholar]
Meinshausen N, Meier L, Bühlmann P. p-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]
Nesterov Y. Introductory lectures on convex optimization:A basic course. Vol. 87. Springer; 2004. [Google Scholar]
Nickl R, van de Geer S. Confidence sets in sparse regression. The Annals of Statistics. 2013;41:2852–2876. [Google Scholar]
Städler N, Bühlmann P, van de Geer S. ℓ1-penalization for mixture regression models. TEST. 2010;19:209–256. [Google Scholar]
Taylor J, Lockhart R, Tibshirani RJ, Tibshirani R. Post-selection adaptive inference for least angle regression and the Lasso. arXiv preprint arXiv:1401.3889. 2014 [Google Scholar]
Tseng P. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of Operations Research. 2004;29:27–44. [Google Scholar]
van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42:1166–1202. [Google Scholar]
van der Vaart AW. Asymptotic statistics. Vol. 3. Cambridge University Press; 2000. [Google Scholar]
Vershynin R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. 2010 [Google Scholar]
Wasserman L, Roeder K. High-dimensional variable selection. The Annals of Statistics. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu CFJ. On the convergence properties of the EM algorithm. The Annals of Statistics. 1983;11:95–103. [Google Scholar]
Yi X, Caramanis C, Sanghavi S. Alternating minimization for mixed linear regression. arXiv preprint arXiv:1310.3745. 2013 [Google Scholar]
Yuan X-T, Li P, Zhang T. Gradient hard thresholding pursuit for sparsity-constrained optimization. arXiv preprint arXiv:1311.5750. 2013 [Google Scholar]
Yuan X-T, Zhang T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]
Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:217–242. [Google Scholar]

[R1] Anandkumar A, Ge R, Hsu D, Kakade SM, Telgarsky M. Tensor decompositions for learning latent variable models. Journal of Machine Learning Research. 2014a;15:2773–2832. [Google Scholar]

[R2] Anandkumar A, Ge R, Janzamin M. Analyzing tensor power method dynamics: Applications to learning overcomplete latent variable models. arXiv preprint arXiv:1411.1488. 2014b [Google Scholar]

[R3] Anandkumar A, Ge R, Janzamin M. Provable learning of overcomplete latent variable models: Semi-supervised and unsupervised settings. arXiv preprint arXiv:1408.0553. 2014c [Google Scholar]

[R4] Balakrishnan S, Wainwright MJ, Yu B. Statistical guarantees for the EM algorithm: From population to sample-based analysis. arXiv preprint arXiv:1408.2156. 2014 [Google Scholar]

[R5] Bartholomew DJ, Knott M, Moustaki I. Latent variable models and factor analysis: A Unified Approach. Vol. 899. Wiley; 2011. [Google Scholar]

[R6] Belloni A, Chen D, Chernozhukov V, Hansen C. Sparse models and methods for optimal instruments with an application to eminent domain. Econometrica. 2012;80:2369–2429. [Google Scholar]

[R7] Belloni A, Chernozhukov V, Hansen C. Inference on treatment effects after selection among high-dimensional controls. The Review of Economic Studies. 2014;81:608–650. [Google Scholar]

[R8] Belloni A, Chernozhukov V, Wei Y. Honest confidence regions for a regression parameter in logistic regression with a large number of controls. arXiv preprint arXiv:1304.3969. 2013 [Google Scholar]

[R9] Bickel PJ. One-step Huber estimates in the linear model. Journal of the American Statistical Association. 1975;70:428–434. [Google Scholar]

[R10] Bickel PJ, Ritov Y, Tsybakov AB. Simultaneous analysis of Lasso and Dantzig selector. The Annals of Statistics. 2009;37:1705–1732. [Google Scholar]

[R11] Boucheron S, Lugosi G, Massart P. Concentration inequalities: A nonasymptotic theory of independence. Oxford University Press; 2013. [Google Scholar]

[R12] Cai T, Liu W, Luo X. A constrained ℓ1 minimization approach to sparse precision matrix estimation. Journal of the American Statistical Association. 2011;106:594–607. [Google Scholar]

[R13] Candès E, Tao T. The Dantzig selector: Statistical estimation when p is much larger than n. The Annals of Statistics. 2007;35:2313–2351. [Google Scholar]

[R14] Chaganty AT, Liang P. Spectral experts for estimating mixtures of linear regressions. arXiv preprint arXiv:1306.3729. 2013 [Google Scholar]

[R15] Chaudhuri K, Dasgupta S, Vattani A. Learning mixtures of Gaussians using the k-means algorithm. arXiv preprint arXiv:0912.0086. 2009 [Google Scholar]

[R16] Chrétien S, Hero AO. On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics. 2008;12:308–326. [Google Scholar]

[R17] Dasgupta S, Schulman L. A probabilistic analysis of EM for mixtures of separated, spherical Gaussians. Journal of Machine Learning Research. 2007;8:203–226. [Google Scholar]

[R18] Dasgupta S, Schulman LJ. A two-round variant of EM for Gaussian mixtures. In Uncertainty in artificial intelligence, 2000. 2000 [Google Scholar]

[R19] Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Statistical Methodology) 1977;39:1–38. [Google Scholar]

[R20] Javanmard A, Montanari A. Confidence intervals and hypothesis testing for high-dimensional regression. Journal of Machine Learning Research. 2014;15:2869–2909. [Google Scholar]

[R21] Khalili A, Chen J. Variables selection in finite mixture of regression models. Journal of the American Statistical Association. 2007;102:1025–1038. [Google Scholar]

[R22] Knight K, Fu W. Asymptotics for Lasso-type estimators. The Annals of Statistics. 2000;28:1356–1378. [Google Scholar]

[R23] Lee JD, Sun DL, Sun Y, Taylor JE. Exact inference after model selection via the Lasso. arXiv preprint arXiv:1311.6238. 2013 [Google Scholar]

[R24] Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R. A significance test for the Lasso. The Annals of Statistics. 2014;42:413–468. doi: 10.1214/13-AOS1175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Loh P-L, Wainwright MJ. Corrupted and missing predictors: Minimax bounds for high-dimensional linear regression; IEEE International Symposium on Information Theory, 2012; 2012. [Google Scholar]

[R26] McLachlan G, Krishnan T. The EM algorithm and extensions. Vol. 382. Wiley; 2007. [Google Scholar]

[R27] Meinshausen N, Bühlmann P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2010;72:417–473. [Google Scholar]

[R28] Meinshausen N, Meier L, Bühlmann P. p-values for high-dimensional regression. Journal of the American Statistical Association. 2009;104:1671–1681. [Google Scholar]

[R29] Nesterov Y. Introductory lectures on convex optimization:A basic course. Vol. 87. Springer; 2004. [Google Scholar]

[R30] Nickl R, van de Geer S. Confidence sets in sparse regression. The Annals of Statistics. 2013;41:2852–2876. [Google Scholar]

[R31] Städler N, Bühlmann P, van de Geer S. ℓ1-penalization for mixture regression models. TEST. 2010;19:209–256. [Google Scholar]

[R32] Taylor J, Lockhart R, Tibshirani RJ, Tibshirani R. Post-selection adaptive inference for least angle regression and the Lasso. arXiv preprint arXiv:1401.3889. 2014 [Google Scholar]

[R33] Tseng P. An analysis of the EM algorithm and entropy-like proximal point methods. Mathematics of Operations Research. 2004;29:27–44. [Google Scholar]

[R34] van de Geer S, Bühlmann P, Ritov Y, Dezeure R. On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics. 2014;42:1166–1202. [Google Scholar]

[R35] van der Vaart AW. Asymptotic statistics. Vol. 3. Cambridge University Press; 2000. [Google Scholar]

[R36] Vershynin R. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027. 2010 [Google Scholar]

[R37] Wasserman L, Roeder K. High-dimensional variable selection. The Annals of Statistics. 2009;37:2178–2201. doi: 10.1214/08-aos646. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Wu CFJ. On the convergence properties of the EM algorithm. The Annals of Statistics. 1983;11:95–103. [Google Scholar]

[R39] Yi X, Caramanis C, Sanghavi S. Alternating minimization for mixed linear regression. arXiv preprint arXiv:1310.3745. 2013 [Google Scholar]

[R40] Yuan X-T, Li P, Zhang T. Gradient hard thresholding pursuit for sparsity-constrained optimization. arXiv preprint arXiv:1311.5750. 2013 [Google Scholar]

[R41] Yuan X-T, Zhang T. Truncated power method for sparse eigenvalue problems. Journal of Machine Learning Research. 2013;14:899–925. [Google Scholar]

[R42] Zhang C-H, Zhang SS. Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:217–242. [Google Scholar]

PERMALINK

High Dimensional EM Algorithm: Statistical Optimization and Asymptotic Normality

Zhaoran Wang

Quanquan Gu

Yang Ning

Han Liu

Abstract

1 Introduction

Comparison with Related Work

Notation

2 Methodology

2.1 High Dimensional EM Algorithm

Algorithm 1.

Algorithm 2.

Algorithm 3.

2.2 Asymptotic Inference

Notation

Decorrelated Score Test

Lemma 2.1

Proof

Decorrelated Wald Test

2.3 Applications to Latent Variable Models

Gaussian Mixture Model

Mixture of Regression Model

Regression with Missing Covariates

3 Theory of Computation and Estimation

Condition 3.1

Condition 3.2

Condition 3.3

3.1 Main Results

Algorithm 4.

Theorem 3.4

Proof

3.2 Implications for Specific Models

Gaussian Mixture Model

Lemma 3.5

Proof

Lemma 3.6

Proof

Theorem 3.7

Proof

Mixture of Regression Model

Lemma 3.8

Proof

Lemma 3.9

Proof

Theorem 3.10

Proof

Regression with Missing Covariates

Lemma 3.11

Proof

Lemma 3.12

Proof

Theorem 3.13

Proof

4 Theory of Inference

Condition 4.1 Parameter-Estimation (ζEM)

Condition 4.2 Gradient-Statistical-Error (ζG)

Condition 4.3 Tn(·)-Concentration (ζT)

Condition 4.4 Tn(·)-Lipschitz(ζL)

Assumption 4.5

4.1 Main Results

Decorrelated Score Test

Theorem 4.6

Proof

Decorrelated Wald Test

Theorem 4.7

Proof

Optimality

4.2 Implications for Specific Models

Gaussian Mixture Model

Lemma 4.8

Proof

Lemma 4.9

Proof

Lemma 4.10

Proof

Theorem 4.11

Mixture of Regression Model

Lemma 4.12

Condition 4.1 `Parameter-Estimation` (ζ^EM)

Condition 4.2 `Gradient-Statistical-Error` (ζ^G)

Condition 4.3 T_n(·)-`Concentration` (ζ^T)

Condition 4.4 T_n(·)-`Lipschitz`(ζ^L)