A proximal distance algorithm for likelihood-based sparse covariance estimation

JASON XU; KENNETH LANGE

doi:10.1093/biomet/asac011

. Author manuscript; available in PMC: 2023 Dec 13.

Published in final edited form as: Biometrika. 2022 Feb 16;109(4):1047–1066. doi: 10.1093/biomet/asac011

A proximal distance algorithm for likelihood-based sparse covariance estimation

JASON XU ¹, KENNETH LANGE ²

PMCID: PMC10716840 NIHMSID: NIHMS1942701 PMID: 38094986

Summary

This paper addresses the task of estimating a covariance matrix under a patternless sparsity assumption. In contrast to existing approaches based on thresholding or shrinkage penalties, we propose a likelihood-based method that regularizes the distance from the covariance estimate to a symmetric sparsity set. This formulation avoids unwanted shrinkage induced by more common norm penalties, and enables optimization of the resulting nonconvex objective by solving a sequence of smooth, unconstrained subproblems. These subproblems are generated and solved via the proximal distance version of the majorization-minimization principle. The resulting algorithm executes rapidly, gracefully handles settings where the number of parameters exceeds the number of cases, yields a positive-definite solution, and enjoys desirable convergence properties. Empirically, we demonstrate that our approach outperforms competing methods across several metrics, for a suite of simulated experiments. Its merits are illustrated on international migration data and a case study on flow cytometry. Our findings suggest that the marginal and conditional dependency networks for the cell signalling data are more similar than previously concluded.

Some key words: Distance-to-set penalty, Majorization-minimization, Penalized likelihood, Proximal algorithm, Sequential unconstrained minimization, Sparse estimation

1. Introduction

The task of estimating a covariance matrix from randomly sampled data is central in multivariate analysis. Unfortunately, estimation is complicated by several statistical and computational obstacles. Chief among the latter is the quadratic growth of the number of free parameters in the number of features $p$ . If $n$ is the number of cases, it is known that the sample covariance estimator degrades as the ratio $p / n$ increases (Stein, 1956) and becomes singular as soon as $p > n$ . A more subtle difficulty lies in producing good estimators that maintain positive definiteness. Most approaches seek to mitigate the curse of dimensionality by imposing parsimony through assumptions on the size and structure of the effective parameters, a strategy that has proven successful in many applications. In this paper, we focus on the setting where the covariance matrix satisfies a patternless sparsity assumption. Here sparsity has an important interpretation, namely that zero entries in the covariance matrix encode marginal independence between features.

Since the work of Stein (1956), covariance estimation has remained an active area of research. Many regularized estimators have been proposed to achieve sparsity; Pourahmadi (2011), Chi & Lange (2014) and Fan et al. (2016) provide excellent overviews. Some researchers assume a known ordering of variables; such estimators based on tapering, banding or the Cholesky decomposition generally are sensitive to permutations of the features (Wu & Pourahmadi, 2003; Bickel & Levina, 2008b; Levina et al., 2008; Cai et al., 2010; Bien et al., 2016). When no natural ordering is available, a simple tactic involves thresholding the sample covariance matrix by setting small entries to zero (Bickel & Levina, 2008a; Karoui, 2008; Rothman et al., 2009; Cai & Liu, 2011). Although such elementwise operations straightforwardly induce sparsity, it is well documented that the resulting estimator is not always positive definite. Related Frobenius norm-based approaches include an additional log-barrier term (Rothman, 2012) or appeal to alternating-directions methods (Xue et al., 2012) to enforce positive definiteness. Similar methods have been developed for sparse correlation estimation (Cui et al., 2016). In general, great care must be taken in selecting thresholding constants to ensure positive definiteness. In many cases, the appropriate range is too narrow to induce an effective amount of sparsity (Azose & Raftery, 2018).

Penalized likelihood techniques offer an alternative to thresholding and are arguably the preferred method for estimating sparse precision, or inverse covariance, matrices (Yuan & Lin, 2007; Molstad & Rothman, 2018). Sparsity carries a different interpretation here: zero entries in the precision matrix encode conditional rather than marginal independence. In this case, the negative Gaussian loglikelihood is convex, which not only ensures that minimizers are global optima, but also enables the use of fast algorithms, such as the graphical lasso (Friedman et al., 2008), that make estimation easy under convex penalties such as an $ℓ_{1}$ -norm term. Lasso penalization also comes with disadvantages such as shrinkage towards the origin, which may lead to biased estimates and the inclusion of spurious predictors.

Penalized likelihood estimation is decidedly more difficult when one seeks to estimate a sparse covariance matrix. Because the negative loglikelihood in $Σ$ is no longer convex, significant computational difficulties arise. These challenges may explain in part the smaller literature on this problem relative to precision estimation. Lam & Fan (2009) studied the properties of $ℓ_{1}$ -penalized covariance estimation, and Bien & Tibshirani (2011) proposed a majorization-minimization algorithm that makes use of generalized gradient descent. With the latter approach, convergence hinges on imposing a Lipschitz differentiability assumption that is realized by restricting the space to a subset of the positive-definite cone. In practice, this restriction introduces an additional inner optimization subproblem, which is more cumbersome to implement and may be numerically unstable even in moderate dimensions. Step-size selection can precipitate a delicate trade-off between stability and practical rates of convergence. Azose & Raftery (2018) built upon this work to propose a method for maximum a posteriori estimation that faces similar challenges. They reported that cross-validation on a problem with $n = 12$ and dimension $p \approx 200$ already became computationally impractical.

In the present paper, we revisit the penalized likelihood framework for sparse covariance estimation under a distance-to-set penalty in place of a norm penalty. In prior work, such distance penalties have proven effective in contexts such as generalized linear regression under both rank and sparsity constraints (Xu et al., 2017). Our penalization keeps parameter estimates close to the sparse constraint set while restricting estimates to the positive-definite cone. Neither additional assumptions on the structure of the covariance matrix nor prior knowledge of the location of zero entries need to be imposed (Chaudhuri et al., 2007). Our method thus performs model selection while delivering a positive-definite estimate of the covariance matrix, avoiding the systematic shrinkage engendered by convex norm penalties.

Distance penalization also confers significant computational advantages. We develop a proximal distance algorithm that effectively solves the nonconvex optimization problem. Like Bien & Tibshirani (2011), we employ the majorization-minimization principle. Our algorithm enjoys a descent property as it converges to a stationary point of the objective, automatically selects a good step size, and yields closed-form solutions to its subproblems. The algorithm tends to converge quickly because the underlying surrogate functions tightly approximate the likelihood. These advantages are illustrated by simulation studies and applications to real data on cell signalling and international migration.

2. Background and Penalized Formulation

Consider estimation of the covariance matrix $Σ$ given $n$ independent and identically distributed random vectors $X_{1}, \dots, X_{n} \sim N_{p} (0, Σ)$ . Without loss of generality, we focus on the zero-mean case and estimation of $Σ$ alone. In this scenario, the loglikelihood of the data is

L (Σ) = - \frac{n}{2} log det Σ - \frac{n}{2} tr (Σ^{- 1} S),

(1)

where $S = n^{- 1} \sum_{i = 1}^{n} X_{i} X_{i}^{T}$ denotes the sample covariance matrix. When the data are weakly dependent or non-Gaussian, estimation may still proceed on the basis of $L (Σ)$ , provided $L (Σ)$ is interpreted as a quasilikelihood. It is desirable for an estimator of $Σ$ to be positive definite; previous work has achieved this by imposing the domain constraint $Σ ≻ 0$ . Alternatively, we may set $L (Σ) = - \infty$ whenever $Σ$ fails to be positive definite.

We seek to maximize (1) subject to the assumption that many of the entries in $Σ$ are zero. Accordingly, let $K$ denote the number of nonzero entries in the upper triangle, and denote by ${‖Σ‖}_{0}$ the number of nonzero off-diagonal entries in $Σ$ . Sparse estimation of $Σ$ can then be formally cast as the constrained optimization problem of minimizing

f (Σ) = log det Σ + tr (Σ^{- 1} S)

(2)

subject to $Σ ≻ 0$ and $Σ$ belonging to the sparsity set

C = \{Σ \in ℝ^{p \times p} : Σ = Σ^{T}, {‖Σ‖}_{0} ⩽ 2 K\} .

(3)

Here the diagonal entries of $Σ$ are unconstrained.

Directly minimizing criterion (2) is challenging. Indeed, letting $m = (\begin{array}{l} p \\ 2 \end{array})$ , there are $(\begin{array}{l} m \\ K \end{array})$ different sparsity patterns for a model with $K$ effective parameters. Hence, even ignoring the positive-definiteness constraint, optimizing $f (Σ)$ over $C$ quickly becomes combinatorially intractable. As a practical alternative, one can include a lasso penalty regularizing the $ℓ_{1}$ -norm of a function of $Σ$ . Convex relaxation of the $ℓ_{0}$ constraint appearing in $C$ in this fashion provides a viable means of promoting sparsity by proxy. For covariance estimation, Bien & Tibshirani (2011) considered such a penalty applied to $A \circ Σ$ , where $A$ has nonnegative entries interpretable as weights and $\circ$ denotes the Hadamard or elementwise product. The resulting optimization problem

minimize \{log det Σ + tr (Σ^{- 1} S) + λ {‖A \circ Σ‖}_{1}\} subject to Σ ≻ 0

(4)

remains nontrivial. This nonconvex objective equals the difference of two convex functions. Exploiting this structure, Bien & Tibshirani (2011) proposed a majorization-minimization algorithm, which is described in the next section.

Including a lasso penalty as a proxy for the sparsity constraint entails shrinking the solution globally towards the origin. Such shrinkage biases parameter estimates towards zero and tends to produce false positives. Nonetheless, several advantages have made the approach popular. Lasso penalties are convex, and their inclusion for solving convex objectives not only admits unique minimizers, but also enables the use of existing fast algorithms to find their solutions (Friedman et al., 2008). Unfortunately, as the covariance likelihood is already nonconvex, adding an $ℓ_{1}$ penalty in covariance estimation does not yield a convex objective and does not enforce positive definiteness. The remedy of embedding an inner iterative algorithm, such as an alternating-directions method, within an outer gradient descent algorithm, is often slow and unstable. Failures of positive definiteness also beset simple thresholding approaches (Rothman et al., 2009; Rothman, 2012), and similar remedies in this context are subject to the same criticisms (Xue et al., 2012).

As an alternative to solving problem (2), we propose minimizing the penalized objective

h_{ρ} (Σ) = log det Σ + tr (Σ^{- 1} S) + \frac{ρ}{2} dist {(Σ, C)}^{2},

(5)

where $dist (Σ, C) = inf \{{‖Σ - A‖}_{F} : A \in C\}$ denotes the Euclidean distance from $Σ$ to $C$ , with ${‖\cdot‖}_{F}$ being the Frobenius norm. The distance penalty pulls the solution $\hat{Σ}$ towards $C$ and equals zero precisely when $Σ \in C$ . This novel formulation entails solving an unconstrained optimization problem, but coincides with the original objective (2) in the limit as $ρ \to \infty$ . This is summarized in the following restatement of the classical penalty method (Courant, 1943).

PROPOSITION 1. Suppose that both the loss $f (x)$ and the nonnegative penalty $p (x)$ are continuous on $ℝ^{p}$ and that the penalized objectives

F_{n} (x) = f (x) + ρ_{n} p (x)

are coercive on $ℝ^{p}$ . For any sequence $ρ_{n}$ increasing to $\infty$ , there is a corresponding sequence of minimizers $x_{n}$ with $f (x_{n}) ⩽ f (x_{n + 1})$ . Further, any cluster point of this sequence resides in the feasible region $S = \{x : p (x) = 0\}$ and attains the minimum value of $f (x)$ . Finally, if $f (x)$ is coercive and possesses a unique minimizer $x^{*}$ in $S$ , then the sequence $x_{n}$ converges to $x^{*}$ .

This result justifies improving the objective (5) while gradually increasing the penalty parameter $ρ$ instead of directly tackling the constrained problem (2). We show in the Appendix that positive definiteness of the sample covariance $S$ is sufficient to satisfy the technical requirement of coercivity. Notably, coercivity fails to hold when $p > n$ , but it can be reintroduced by adding a small multiple of the identity to $S$ . We later observe that this safeguard is numerically unnecessary in practice, with a negligible difference in performance. In its favour, the distance-penalized formulation circumvents explicit consideration of the constraints, evades global shrinkage, and lends itself to the derivation of a practical algorithm via majorization-minimization.

3. Majorization-minimization

Majorization-minimization algorithms are becoming increasingly popular for solving large-scale optimization problems in statistics and machine learning (Mairal, 2015; Lange, 2016; Xu & Lange, 2019). A majorization-minimization algorithm successively minimizes a sequence of surrogate functions $g (x | x_{k})$ that dominate an objective function $f (x)$ and are tangent to it at the current iterate $x_{k}$ . Decreasing $g (x | x_{k})$ automatically engenders a decrease in $f (x)$ , and a local optimum of $f (x)$ is found by successively minimizing the sequence of surrogates.

Majorization requires two conditions: tangency and domination. Formally, these amount to $g (x_{k} | x_{k}) = f (x_{k})$ and $g (x | x_{k}) ⩾ f (x)$ for every $x$ . The resulting update $x_{k + 1} = arg {min}_{x} g (x | x_{k})$ implies the string of inequalities

f (x_{k + 1}) ⩽ g (x_{k + 1} | x_{k}) ⩽ g (x_{k} | x_{k}) = f (x_{k}),

(6)

validating the descent property. Examination of the proof of the descent (6) shows that exact minimization of $g (x | x_{k})$ is not strictly necessary, a practical advantage that we will utilize. The celebrated expectation-maximization principle (Dempster et al., 1977) for maximum likelihood estimation is a special case of this principle that relies on the notion of missing data. In this setting the surrogate $g (x | x_{k})$ is defined as the expected value of the complete data loglikelihood given the observed data.

The majorization-minimization principle thus offers a general recipe for converting a hard optimization problem into a sequence of manageable subproblems. The distance-to-set penalty $dist {(x, C)}^{2}$ enters this framework through distance majorization (Chi et al., 2014). The key idea is to write the penalty in terms of the Euclidean norm as

dist (x, C) = min_{y \in C} {‖x - y‖}_{2} = {‖x - P_{C} (x)‖}_{2},

where $P_{C} (x)$ denotes the projection of $x$ onto the constraint set $C$ . Squaring the distance term is a practical manoeuvre that leads to differentiability and the simple gradient

\nabla \frac{1}{2} dist {(x, C)}^{2} = x - P_{C} (x)

when $P_{C} (x)$ is single-valued (Lange, 2016). Fortunately, the projection operator $P_{S} (x)$ onto any closed set $S$ is single-valued except for a set of Lebesgue measure zero (Keys et al., 2019). Hence, the technical possibility that $P_{C} (x)$ becomes multi-valued for $C$ nonconvex is vanishingly rare from a theoretical perspective. Indeed, this event is negligible in practice as well: if a multi-valued point is encountered, the user is shielded from this exception because the code automatically selects a point in $P_{C} (x_{k})$ and delivers a valid surrogate. The distance majorization

dist {(x, C)}^{2} = {‖x - P_{C} (x)‖}_{2}^{2} ⩽ {‖x - y_{k}‖}_{2}^{2}

for all $y_{k} \in P_{C} (x_{k})$ follows directly from the definition of the projection operator $P_{C} (x)$ . This majorization is useful in practice because it replaces the distance penalty by a spherically symmetric quadratic with the same gradient at $x_{k}$ .

Recall that in the present context the relevant constraint set $C$ defined in (3) consists of all symmetric matrices with at most $k$ nonzero entries in their upper triangle. The choice of $k$ determines the level of sparsity. Computing the projection of a symmetric matrix onto $C$ is accomplished by setting all but the diagonal, and the $k$ largest entries in absolute value of each triangle to zero. Because we have not yet imposed the constraint of positive definiteness, this projection is computed simply by hard-thresholding entries in the upper triangle and propagating the results to the lower triangle symmetrically. The next section describes how distance majorization creates a sequence of unconstrained smooth problems, as well as how the positive-definiteness constraint can be enforced via simple backtracking.

4. Algorithm for sparse covariance estimation

4.1. A proximal distance algorithm

The recently introduced proximal distance principle (Keys et al., 2019) replaces the constrained problem ${min}_{x \in C} f (x)$ by unconstrained minimization of the penalized loss $f (x) + (ρ / 2) dist {(x, C)}^{2}$ . In our setting, the loss (2) plays the role of $f (x)$ under the sparsity constraint (3). The unconstrained reformulation (5) can then be solved using distance majorization. Proposition 1 implies that if $ρ$ is sufficiently large, then the solution of the penalized problem accurately approximates the solution of the constrained problem. For any given value of $ρ$ , applying the proximal distance principle requires majorizing the objective $h_{ρ} (Σ)$ by the function $g_{ρ} (x | x_{k}) = f (x) + (ρ / 2) {‖x - P_{C} (x_{k})‖}_{2}^{2}$ . Now, the minimizer of this surrogate function is given by a proximal operator. Recall that for any function $r (x)$ , the proximal operator is defined as

{prox}_{λ r} (y) \equiv \underset{x}{arg min} r (x) + \frac{1}{2 λ} {‖x - y‖}_{2}^{2}

with $λ = ρ^{- 1}$ . The operator ${prox}_{λ r} (y)$ represents a compromise between minimizing $r (x)$ and hewing towards $y$ , with the parameter $λ$ modulating the trade-off; Polson et al. (2015) provides an excellent overview of proximal methods in statistics.

Because it is not possible to find an analytical expression for the proximal operator of $f (x)$ , we cannot easily minimize the surrogate function $g_{ρ} (Σ | Σ_{k})$ generated by the covariance likelihood. Instead, we construct a more useful surrogate function $q_{ρ} (Σ | Σ)$ using a local quadratic approximation tailored to $g_{ρ} (Σ | Σ_{k})$ . This local surrogate possesses three advantages over linear surrogates used in past approaches (Bien & Tibshirani, 2011; Azose & Raftery, 2018). First, a quadratic surrogate provides a tighter approximation than the linear surrogates previously applied in this problem, often translating to dramatically more efficient steps towards the optimum. Second, the proximal operator of our surrogate $q_{ρ} (Σ | Σ_{k})$ admits a closed-form solution; each subproblem can be minimized exactly, in contrast to gradient steps whose progress and stability depend heavily on the choice of step sizes. Finally, by exploiting a surprising connection to control theory, evaluation of these closed solutions becomes practical, effecting a reduction in computational complexity from $O (p^{6})$ using a naive evaluation to a more tractable $O (p^{3})$ .

4.2. Constructing and minimizing the surrogate

Recall that the relevant loss is $f (Σ) = log det Σ + tr (Σ^{- 1} S)$ . To define a sequence of quadratic approximations to $f (Σ)$ , we take matrix directional derivatives of $f (Σ)$ in directions $U$ and $V$ ,

d_{V} f (Σ) = tr (Σ^{- 1} V) - tr (Σ^{- 1} V Σ^{- 1} S),

d_{U} d_{V} f (Σ) = - tr (Σ^{- 1} U Σ^{- 1} V) + tr (Σ^{- 1} U Σ^{- 1} V Σ^{- 1} S) + tr (Σ^{- 1} V Σ^{- 1} U Σ^{- 1} S),

V^{T} d^{2} f (Σ) V = - tr (Σ^{- 1} V Σ^{- 1} V) + 2 tr (Σ^{- 1} V Σ^{- 1} V Σ^{- 1} S),

where the quadratic form in the last line is obtained by setting $U = V$ . The second differential simplifies considerably if we replace $S$ by its expected value $E (S) = Σ$ , a manoeuvre familiar from the derivation of Fisher's scoring algorithm. This substitution precipitates a cancellation of higher-order terms; the overall result

V^{T} d^{2} f (Σ) V \approx tr (Σ^{- 1} V Σ^{- 1} V)

is a positive-definite quadratic form. We may now define an approximate quadratic surrogate $q_{ρ} (Σ | Σ_{k})$ of the penalized objective $h_{ρ} (Σ)$ by taking a second-order Taylor expansion of the loss $f (Σ)$ about the current estimate:

\begin{array}{l} q_{ρ} (Σ | Σ_{k}) = f (Σ_{k}) + tr \{Σ_{k}^{- 1} (Σ - Σ_{k})\} - tr \{Σ_{k}^{- 1} S Σ_{k}^{- 1} (Σ - Σ_{k})\} \\ + \frac{1}{2} tr \{Σ_{k}^{- 1} (Σ - Σ_{k}) Σ_{k}^{- 1} (Σ - Σ_{k})\} + \frac{ρ}{2} {‖Σ - P_{C} (Σ_{k})‖}_{F}^{2} . \end{array}

Here the majorized distance penalty appears as the final term. In contrast to an $ℓ_{1}$ -penalized loss, this surrogate is differentiable with gradient expression

\frac{d}{dΣ} q_{ρ} (Σ | Σ_{k}) = Σ_{k}^{- 1} - Σ_{k}^{- 1} S Σ_{k}^{- 1} + Σ_{k}^{- 1} (Σ - Σ_{k}) Σ_{k}^{- 1} + ρ \{Σ - P_{C} (Σ_{k})\} .

Equating the gradient to 0 and rearranging yields the stationarity equation

ρ P_{C} (Σ_{k}) + Σ_{k}^{- 1} S Σ_{k}^{- 1} = ρ Σ + Σ_{k}^{- 1} {ΣΣ}_{k}^{- 1} .

(7)

If we abbreviate the left-hand side by

C_{k} \equiv ρ P_{C} (Σ_{k}) + Σ_{k}^{- 1} S Σ_{k}^{- 1}

and stack matrices into vectorized notation, then (7) can be rewritten as

vec (C_{k}) = ρ vec (Σ) + (Σ_{k}^{- 1} \otimes Σ_{k}^{- 1}) vec (Σ),

where $\otimes$ denotes the Kronecker product. Upon inversion, the solution amounts to

vec (\hat{Σ}) = {\{ρ I_{p^{2}} + (Σ_{k}^{- 1} \otimes Σ_{k}^{- 1})\}}^{- 1} vec (C_{k}),

(8)

and we can recover the minimizer $\hat{Σ}$ by reshaping.

The analytical solution (8) involves the inverse of a $p^{2} \times p^{2}$ matrix and hence scales as $O (p^{6})$ . This computational load puts problems of even moderate dimension $p$ beyond reach. Upon multiplication of both sides by the constant $Σ_{k}$ and closer inspection, (7) takes the general form $A Σ + Σ B = C$ , which we recognize as a Sylvester equation in $Σ$ . Like the closely related and better-known Lyapunov equations arising in dynamical systems, Sylvester equations are well studied in control theory and eigenvalue problems (Higham, 2002). It is known that the equation has a unique solution if and only if the eigenvalues of $A$ and $- B$ are distinct; this condition is satisfied in the present case because $Σ_{k}^{- 1}$ is positive definite. More pertinently, we can borrow a numerical method from the control theory literature. An algorithm due to Bartels & Stewart (1972) provides a more efficient solution than direct evaluation of (8). The first step and crux of the procedure consists of transforming the problem into Schur form by computing decompositions $A = U R U^{T}$ and $B = V S V^{T}$ via the QR algorithm. Because $R$ and $S$ are upper triangular, the equivalent upper triangular system $R Y + Y S = U^{T} C V$ with $Y = U^{T} Σ V$ can be solved by simple back-substitution. Multiplication then recovers the original solution $Σ = U Y V^{T}$ . The computational complexity declines from $O (p^{6})$ operations required to compute formula (8) to $O (p^{3})$ . Current state-of-the-art implementations are variations on this theme and have the same overall complexity; see Simoncini (2016) for details.

Before proceeding, we briefly mention that sparse fitting of the sample correlation matrix can exploit the same algorithm with a slight modification that has appeared previously in the literature. Let $R = D^{- 1 / 2} S D^{- 1 / 2}$ denote the sample correlation matrix, where $D = diag (S)$ contains the observed variances. In estimation with $R$ replacing $S$ , we minimize the criterion

log det Θ + tr (Θ^{- 1} R) + \frac{ρ}{2} dist {(Θ, C R)}^{2}

over $Θ ≻ 0$ , where $C R$ is the set of $K$ -sparse symmetric matrices with unit diagonal entries. Projection of $Θ_{k}$ onto this set maps the diagonal entries of $Θ_{k}$ to 1 and treats the off-diagonal entries as before.

4.3. Positive definiteness and gradient interpretation

So far, the penalty term in the objective (5) only accounts for the sparsity set constraint $C$ . Because the matrix $ρ P_{C} (Σ_{k}) + Σ_{k}^{- 1} S Σ_{k}^{- 1}$ is not guaranteed to be positive definite, neither is the solution $\hat{Σ}$ that minimizes the surrogate given in (7). Moreover, the approximate surrogate $q_{ρ} (Σ | Σ_{k})$ does not strictly majorize $h_{ρ} (Σ) = f (Σ) + (ρ / 2) dist {(Σ, C)}^{2}$ for all possible $Σ$ , and so naively minimizing $q_{ρ} (Σ | Σ_{k})$ does not necessarily decrease $h_{ρ} (Σ)$ . Both of these issues can be handled gracefully via backtracking. The next proposition ensures the success of step-halving, which amounts to defining

Σ_{k + 1} = Σ_{k} + \frac{1}{2^{s}} (\hat{Σ} - Σ_{k})

(9)

based on the smallest integer $s ⩾ 0$ that renders $Σ_{k + 1} ≻ 0$ and decreases the objective $h_{ρ} (Σ)$ . The result becomes clear after considering the representation

\hat{Σ} = Σ_{k} + v_{k} = Σ_{k} - H_{k}^{- 1} \nabla q_{ρ} (Σ_{k} | Σ_{k}),

(10)

where $H_{k}$ is the scoring approximation to the Hessian; a complete proof of the following proposition appears in the Appendix.

PROPOSITION 2. If $Σ_{k}$ is not a stationary point of $h_{ρ} (Σ)$ , then there exists an integer s such that $Σ_{k + 1}$ given in (9) satisfies $Σ_{k + 1} ≻ 0$ and $h_{ρ} (Σ_{k + 1}) < h_{ρ} (Σ_{k})$ .

The method is summarized in pseudocode in Algorithm 1, which reveals that the careful technical work behind the preceding analysis is largely hidden from the user's perspective. The algorithm is relatively transparent and easy to implement. Open-source Julia and R code implementing the method and for reproducing the experiments in this paper are available at https://github.com/jasonxu90/spcov.

Recall that the squared distance penalty does not yield an exact constrained solution for finite $ρ$ . The user may optionally perform one final projection of the estimate at convergence onto the sparsity set if exact sparsity is desired.

Before proceeding further, let us pause to compare our surrogate with the surrogate proposed in the sparse covariance method of Bien & Tibshirani (2011). Based on the concave-convex procedure of Yuille & Rangarajan (2003), they employ the tangent-plane majorizer

log det Σ_{k} + tr (Σ_{k}^{- 1} Σ) - p + tr (Σ^{- 1} S) + λ {‖A \circ Σ‖}_{1}

for the $ℓ_{1}$ -penalized objective (4). The resulting majorization-minimization iteration

Σ_{k + 1} = \underset{Σ ≻ 0}{arg min} \{tr (Σ_{k}^{- 1} Σ) + tr (Σ^{- 1} S) + λ {‖A \circ Σ‖}_{1}\}

is carried out via generalized gradient descent (Beck & Teboulle, 2009). The choice of a good step size is crucial for a reasonable rate of convergence in practice. Because the linear approximation only loosely models their objective function

log det Σ + tr (Σ^{- 1} S) + λ {‖A \circ Σ‖}_{1},

a given step-size may be well suited at some points, but drastically overshoot the minimum or exit the positive-definite cone at others. Whenever the latter occurs, an additional subproblem must be solved by an alternating-directions method (Boyd et al., 2011). This inner optimization loop slows convergence and is decidedly more difficult to implement than backtracking. Stability can be enhanced by decreasing the initial step size at the expense of more outer iterations. Our quadratic expansion of the loglikelihood produces an approximate surrogate $q_{ρ} (Σ | Σ_{k})$ that hugs our objective $h_{ρ} (Σ)$ more closely. Substitution of a distance penalty for a lasso penalty also gives smoothness. In practice, the minimizer $\hat{Σ}$ of $q_{ρ} (Σ | Σ_{k})$ rarely fails to diminish $h_{ρ} (Σ)$ or leaves the positive-definite cone, so typically we update $Σ_{k + 1} = \hat{Σ}$ without backtracking. These differences translate to substantial performance advantages, as illustrated in § 5.

Finally, a reviewer raised the question of convergence to the global optimum, a valid concern that besets all nonconvex optimization problems. To our disappointment, we initially found that our algorithm was somewhat sensitive to initial guesses $Σ$ close to the sample covariance matrix $S$ . After some experimentation we discovered that the algorithm delivers remarkably stable performance when initiated instead as a diagonal matrix with sample variances appearing along the diagonal. It may be that perturbations of the full $S$ are more likely to lie close to the constraint boundary. This can impede progress and precipitate smaller gradient steps when more backtracking is necessary. Starting from $diag (S)$ or even an identity matrix, the algorithm tends to stay well within the interior of the positive-definite cone, and has more room to learn from the data and consistently reach the optimum.

4.4. Convergence

Recall that due to nonconvexity of the symmetric sparsity set $C$ , it is possible that there exist exceptional points at which the projection operator $P_{C} (Σ)$ is multi-valued. The penalty $dist {(Σ, C)}^{2}$ and in turn the objective $h_{ρ} (Σ)$ are differentiable where $P_{C} (Σ_{k})$ is single-valued, but merely semidifferentiable elsewhere. In contrast, the surrogate $q_{ρ} (Σ | Σ_{k})$ is differentiable regardless of the projected point selected from $P_{C} (Σ_{k})$ .

Although standard convergence results for gradient methods and majorization-minimization algorithms do not immediately apply in proving convergence (Lange, 2016), theoretical guarantees can be established by appealing to the general theory of Zangwill (1969), which encompasses continuous objectives and multi-valued algorithm maps. One can represent our method as an algorithm map $A (Σ)$ taking the current iterate $Σ_{k}$ to the next iterate $Σ_{k + 1}$ . Our novel analysis below treats $A$ as a set-valued map to fully account for the technical possibility that the projection operator is multi-valued, even though the set of points where this can occur has measure zero. The following global convergence result is proved in the Appendix.

Theorem 1. Consider the sequence $Σ_{k + 1} = Σ_{k} + η_{k} v_{k} \in A (Σ_{k})$ generated by the search direction $v_{k}$ of (10) and the step length $η_{k} = arg {min}_{η \in [0, 1]} h_{ρ} (Σ_{k} + η v_{k})$ . If the initial point $Σ_{0}$ is positive definite and the sample covariance matrix $S$ is nonsingular, then the sequence $Σ_{k}$ is bounded and falls within the interior of the positive-definite cone. Furthermore, all of its limit points are stationary points of $h_{ρ} (Σ)$ .

Although the result suggests promising performance despite nonconvexity, we discuss several limitations. First, to simplify mathematical analysis, it supposes an exact line search. This assumption can be relaxed at the expense of a more complicated proof. Second, although the algorithm invariably converges, the proposition cannot guarantee convergence to a global minimizer. It simply says that a convergent subsequence exists whose limit is a stationary point $Σ$ , satisfying

\nabla g_{ρ} (Σ | Σ) = Σ^{- 1} - Σ^{- 1} S Σ^{- 1} + ρ (Σ - Θ) = 0

(11)

for some $Θ \in P_{C} (Σ)$ . As we expect, this stationarity condition is necessary for $Σ$ to furnish a global minimum. Indeed, if it fails, we take $Θ \in P_{C} (Σ)$ with $\nabla g_{ρ} (Σ | Σ) \neq 0$ . Then the negative gradient $- \nabla g_{ρ} (Σ | Σ)$ is a descent direction for $g_{ρ} (Σ | Σ)$ , which majorizes $h_{ρ} (Σ)$ . Hence, $- \nabla g_{ρ} (Σ | Σ)$ would also be a descent direction for $h_{ρ} (Σ)$ , contradicting even local optimality of $Σ$ . Leveraging majorizing surrogates in this fashion establishes directional stationarity, the strongest kind of stationarity in semidifferentiable optimization (Pang et al., 2017; Cui et al., 2018), while avoiding the complications that often come with checking the condition explicitly.

5. Empirical Results

5.1. Simulation study

We illustrate the practical merits of our method on a suite of simulated examples. An open-source Julia implementation of the algorithm is available from the first author's website. In all examples, we initialize our algorithm from the diagonal matrix of sample variances. In practice, taking $Σ_{0} = S$ leads to excessive backtracking in some runs. We initialize $ρ$ at 0.1 and increase it by a factor of 1.2 each iteration. Convergence is declared based on a relative tolerance of $10^{- 6}$ , and similarly when reporting the sparsity level of solutions $\hat{Σ}$ we count those entries greater than $10^{- 6}$ , regarding the others as being effectively zero up to numerical tolerance.

Figure 1 summarizes results for a synthetic data design taken from a study by Bien & Tibshirani (2011); further details appear in the Appendix. Each of three variants of the underlying model, independent, moving average and cliques, exhibits 8% sparsity (nonzero entries) with $p > n$ . Following the analysis of Bien & Tibshirani (2011), we report performance as measured by the entropy loss $tr (Σ^{- 1} \hat{Σ}) - log det (Σ^{- 1} \hat{Σ}) - p$ in Fig. 1 and in terms of receiver operating characteristic curves in the Appendix. Previously entropy loss has been recommended as a measure when the covariance matrix is the primary object of interest (Huang et al., 2006; Levina et al., 2008); the role of $Σ$ in the entropy loss is analogous to how $Ω = Σ^{- 1}$ enters the Kullback-Leibler loss.

Fig. 1. — Entropy loss of estimates under each method plotted over five repeat trials for three variants of the underlying model: (a) independent or random, (b) moving average, and (c) cliques. The black dashed lines represent proximal distance results; the grey solid, dotted and dot-dash lines represent the thresholding, generalized gradient and adaptive generalized gradient methods, respectively. The vertical line in each panel marks the true number of nonzeros.

Figure 1 shows a clear performance advantage of the proximal distance algorithm that becomes more pronounced in high-dimensional settings. In reproducing the results of Bien & Tibshirani (2011), we confirm that calls to the alternating-directions method to enforce positive definiteness are relatively rare when $n > p$ . However, this is not the case in the high-dimensional regime where the sample covariance is not of full rank. In our experience, numerical errors arise in switching between generalized gradient steps and alternating-directions method corrections. The results depicted in Fig. 1 reveal some of this instability over five random replicate trials, most notably for the adaptive version of the generalized gradient method, which uses reciprocals of the entries in the sample covariance as weights in the penalty. At best, it is necessary to significantly reduce the step size, resulting in slower progress.

Next, we perform a detailed comparison with the soft- and hard-thresholding methods of Bickel & Levina (2008a) under the optimal thresholding suggested by Fang et al. (2016), as well as with the penalized log-barrier method of Rothman (2012). We omit generalized gradient descent (Bien & Tibshirani, 2011) in this second study because of its excessive runtimes under cross-validation, and we remark that the method of Xue et al. (2012) yields nearly identical performance to the penalized log-barrier method we consider. We evaluate the entropy loss and root mean squared error between $Σ$ and $\hat{Σ}$ , and report false positive and false negative rates for identifying the nonzero entries in $Σ$ . The results are presented in Tables 1–3.

Table 1.

False positive and false negative rates: percentages of false positives (left) and false negatives (right) over 50 replications; all methods are tuned using five-fold cross-validation. Standard errors are omitted; the largest standard error in the first column was 0.027

p	Proximal distance	Soft threshold	Hard threshold	Log barrier
20	0.1 / 0.0	18.9 / 0.0	0.1 / 0.3	7.8 / 0.0
30	0.2 / 0.6	12.4 / 0.0	0.2 / 2.7	5.8 / 0/0
50	0.4 / 1.9	8.4 / 0.0	0.1 / 7.2	3.9 / 0.3
100	0.5 / 17.8	4.4 / 8.5	0.1 / 42.0	3.2 / 9.7
200	1.0 / 42.4	4.3 / 33.2	0.0 / 79.7	1.1 / 50.1

Open in a new tab

Table 3.

Average root mean squared error: comparison of methods in terms of root mean squared error (with standard errors in parentheses) over 50 replications; all methods are tuned using five-fold cross-validation

p	Proximal distance	Soft threshold	Hard threshold	Log barrier
20	0.050 (0.011)	0.092 (0.012)	0.062 (0.015)	0.078 (0.011)
30	0.061 (0.012)	0.102 (0.009)	0.073 (0.016)	0.085 (0.008)
50	0.081 (0.011)	0.096 (0.006)	0.079 (0.010)	0.080 (0.008)
100	0.118 (0.008)	0.124 (0.005)	0.128 (0.007)	0.118 (0.004)
200	0.141 (0.005)	0.148 (0.002)	0.150 (0.001)	0.143 (0.002)

Open in a new tab

We vary the number $p$ of features from 20 to 200. Under each setting, 50 replicate datasets of size $n = 100$ are generated using a true covariance matrix with 2% sparsity. We remark that when $K$ is known, it can be directly specified in our method. In contrast, hyperparameter tuning remains necessary with known $K$ under shrinkage penalties. Nonetheless, we select $K$ and the tuning constants of the competing methods by five-fold cross-validation to allow a generous comparison; details are given in the Appendix. Table 2 shows that the proximal distance algorithm achieves lower average entropy loss than thresholding and the log-barrier penalized method, a trend that is also evident from the root mean squared error comparisons in Table 3.

Table 2.

Average entropy loss: comparison of methods in terms of entropy loss (with standard errors in parentheses) over 50 replications; all methods are tuned using five-fold cross-validation

p	Proximal distance	Soft threshold	Hard threshold	Log barrier
20	0.28(0.09)	0.94(0.20)	0.49(0.43)	2.01(0.6)
30	0.61(0.27)	2.35(0.63)	1.40(0.98)	4.6(0.9)
50	2.11(0.81)	6.24(1.16)	5.48(2.59)	11.7(1.2)
100	17.6(3.3)	28.7(2.6)	43.7(10.2)	42.6(4.1
200	119.6(6.3)	140.3(4.4)	206.3(6.6)	179.8(5.1)

Open in a new tab

Table 1 shows that hard thresholding typically yields the lowest false positive rates, often at the expense of an alarmingly high false negative rate. In contrast, our method produces a comparable false positive rate while introducing strikingly fewer false negatives. As expected, soft thresholding introduces many false positives in all cases. The log-barrier penalized approach shows a qualitatively similar trend to soft thresholding, but tends to strike a better balance, exhibiting a noticeably lower false positive rate at the cost of a minor increase in false negatives. As we increase $p$ , both soft thresholding and the log-barrier penalized method begin to suffer a false negative rate comparable to that of the proximal distance algorithm. Even when $p$ is small, the existing methods introduce a nontrivial number of either false positives or false negatives, while the proximal distance algorithm can maintain a low rate on both fronts. Finally, although our proposed method is a nonconvex formulation, we did not observe the algorithm stopping short at local minima. For a fixed synthetic dataset, perturbing the initial guess over 20 trials consistently delivered the same optimum. In the results reported above, we run one instance of each algorithm per simulated dataset. Taking the best of several random restarts would only result in more favourable performance of the proximal distance method in the possibility that it converged to inferior local optima in some trials.

While we omit a detailed runtime comparison because of differences in implementations across programming languages, we report average runtimes of our proposed method as $p$ increases beyond the scope of the previous simulations. Figure 2 reveals that for the largest case we consider with $p = 5000$ , in which there are tens of millions of free parameters under the patternless sparsity assumption, the problem remains tractable with a runtime of under 2.5 hours on a standard laptop computer. Most settings complete in seconds, and Fig. 2(b) shows that the runtime scales roughly as $p^{3}$ . In contrast, Xue et al. (2012) reported that the log-barrier method becomes unwieldy for $p > 200$ , while existing likelihood-based methods such as the generalized gradient method in the first simulation study are even slower by a large margin.

Fig. 2. — (a) Runtime of the proximal distance algorithm and (b) cube root of the runtime as functions of the dimension $p$ ; results are averaged over single runs without cross-validation.

5.2. International migration data

Projecting international migration at the country-specific scale is important in shaping policy decisions that arise in social welfare and economic planning. Probabilistic projections are desirable for quantifying uncertainty in these 'barely predictable' global processes (Bijak & Wiśniowski, 2010). Existing global models typically assume that forecast errors are uncorrelated across countries. Although modelling under the independence assumption may be well calibrated for individual countries, ignoring correlations will yield under- or overestimates in projections.

We consider international migration forecast data from the United Nations World Population Prospects division. The data consist of net migration estimates every five years in each country from 1950 to 2010. Following Azose & Raftery (2018), our goal is to estimate the correlation structure among forecast errors. The observations $ϵ_{t} (t = 1, \dots, 11)$ are residual vectors from an AR(1) model for net migration between all countries; the $ϵ_{t}$ are assumed to be independent and identically distributed according to a multivariate normal distribution. We base inference on a small available sample of $n = 11$ measurements, seeking to estimate a correlation matrix $R$ with roughly 18000 entries generated by $p = 191$ country pairs. The Pearson sample correlation is known to degrade in such settings and suggest spurious correlations. Azose & Raftery (2018) considered a Bayesian model that shrinks a priori untrustworthy elements towards zero. This is achieved by penalizing country pairs that are far apart, do not share a colonial relationship, or are located in different regions. These penalties reflect the UN World Population Prospects partition of the globe into 22 regions based on geographical and cultural affinity. The authors employed a slight modification of the approach proposed by Bien & Tibshirani (2011) to extract maximum a posteriori estimates. Azose & Raftery (2018) observed that the method is slow on a problem of this size and renders cross-validation infeasible, instead choosing the $ℓ_{1}$ -penalty parameter $λ$ according to a manual heuristic.

In contrast, the analogous study using five-fold cross-validation with the proximal distance method completes in under a minute on a standard laptop computer. Our estimates deliver sparser solutions than the estimates of Azose & Raftery (2018) under an $ℓ_{1}$ penalty; we record 6145 zeros versus their 323. To illustrate the difference in estimates, we consider a random subset of five countries. Despite the absence of prior knowledge, the proximal distance method reveals relationships that are qualitatively consistent with the criteria used by Azose & Raftery (2018) in the design of their prior. As is apparent from Fig. 3, the zero entries estimated under the proximal distance method correspond to country pairs that occur in different blocks of the UN partition. In contrast, the method of Azose & Raftery (2018) produces small, but nonzero entries, for these pairs and, in general, does not yield a sparse solution.

Fig. 3. — Estimated correlations on a random subset of countries obtained by (a) the proximal distance method, (b) sample correlation, and (c) the method of Azose & Raftery (2018). The proximal distance algorithm results in a sensible sparsity pattern based on criteria suggested by Azose & Raftery (2018); though not visibly obvious, their maximum a posteriori estimate produces no sparse entries on this subset.

To compare the quality of the estimates quantitatively, we recommend using the extended Bayesian information criterion. Although this criterion tends to be more suitable in highdimensional settings, for completeness we also report in Table 4 results in terms of the standard Akaike information criterion and Bayesian information criterion. As anticipated, the denser estimate of Azose & Raftery (2018) achieves a lower negative loglikelihood on the data, but the measures accounting for model complexity favour our sparse solution. Given the limited amount of data, we hesitate to conclude that our estimate is definitively preferable. Indeed, the use of sensible prior knowledge in such a setting is prudent. Despite ignoring a priori information, it is noteworthy that our method is competitive with an ostensibly more tailored approach to the data at hand.

Table 4.

Model selection criteria: comparison of the maximum a posteriori estimate obtained by the Bayesian shrinkage approach of Azose & Raftery (2018) and the proposed proximal distance estimate in terms of the negative loglikelihood, extended Bayesian information criterion, Bayesian information criterion and Akaike information criterion

	$\hat{L}$	EBIC	BIC	AIC
Bayesian shrinkage	−572.1	43208.1	41591.1	34499.8
Proximal distance	−341.7	39701.3	28091.3	23216.6

Open in a new tab

EBIC, extended Bayesian information criterion; BIC, Bayesian information criterion; AIC, Akaike information criterion.

5.3. Flow cytometry data

In our final case study we take a closer look at the marginal and conditional dependency structures in a classic cell signalling study. We revisit the experiment studied by Sachs et al. (2005) involving flow cytometry measurements on $p = 11$ proteins and $n = 7466$ cells. This dataset was previously analysed in the original graphical lasso paper (Friedman et al., 2008) and in a study of $ℓ_{1}$ -penalized covariance estimation (Bien & Tibshirani, 2011). We produce two estimates of the conditional dependency or Markov graph using the graphical lasso with $K = 9$ and $K = 16$ edges. Bien & Tibshirani (2011) used their method to estimate the marginal dependency graph, which does not coincide with estimates of the Markov graph at matched sparsity levels. This is no surprise since the underlying models offer distinct interpretations. A missing edge in the covariance graph tells us that the concentration of one protein gives no information about the concentration of the other, whereas a missing edge in the Markov graph indicates that the concentration of one protein gives no information about the concentration of the other conditional on all other concentrations. While this difference is crucial, our results suggest that the covariance graph may be more similar to the Markov graph than past studies based on $ℓ_{1}$ penalties suggest.

Figure 4 displays covariance graphs obtained by running the generalized gradient descent algorithm of Bien & Tibshirani (2011), and our proximal distance algorithm at sparsity levels matched to the Markov graphs. It is visually clear that our estimate of $\hat{Σ}$ shares more edges with the Markov graph. Although the true covariance graph and Markov graph do not necessarily coincide, these results suggest that the difference between the two in these data may be overstated because of $ℓ_{1}$ shrinkage or convergence to a poor local minimum under generalized gradient descent. It is again difficult to produce a complete range of sparsity levels under an $ℓ_{1}$ penalty. For instance, the generalized gradient estimate in the bottom row of Fig. 4 features one fewer edge than desired, though it yields the closest sparsity estimate before transitioning to $K = 11$ edges over a grid search of mesh size $10^{- 7}$ for the penalty constants. Even in the extreme case, not pictured, where penalty constants are chosen to yield only $K = 1$ edge, the proximal distance algorithm and graphical lasso agree in producing the edge Mek-Raf, while the generalized gradient algorithm selects the sole edge Erk-Akt. Once again we see that the proximal distance algorithm allows us to directly specify the sparsity level $K$ , whereas $ℓ_{1}$ penalization requires tedious calibration to match the penalty constant $λ$ to $K$ . This compact example emphasizes both the computational advantages of the proximal distance algorithm and its ability to deliver dependable solutions uncontaminated by excess shrinkage.

Fig. 4. — Estimated covariance graphs under generalized gradient descent (left) and the proximal distance algorithm (right), compared with the generalized gradient estimate of the Markov graph (middle). The top and bottom rows display two settings in which sparsity levels are matched between methods. In each case, the proximal distance algorithm produces more edges in common with the Markov graph.

6. Discussion

Although our proximal distance algorithm substantially improves the stability, speed and accuracy of sparse covariance matrix estimation, it is hard to avoid a computational complexity of $O (p^{3})$ . For instance, formation of the left-hand side of (7) requires dense matrix inversion and multiplication. One could possibly solve (7) by an iterative algorithm rather than the Bartels & Stewart (1972) algorithm. For instance, the iteration scheme

Θ_{j + 1} = \frac{1}{ρ} D_{k} - \frac{1}{ρ} Σ_{k}^{- 1} Θ_{j} Σ_{k}^{- 1}

converges to $Σ_{k + 1}$ provided that ${‖Σ_{k}^{- 1}‖}_{F}^{2} < ρ$ .

Previous convergence results for proximal distance algorithms do not address nonconvex sets. Although our theory handles the nonconvex sparsity set $C$ , we fix the penalty constant $ρ$ in our analysis. This simplification is justified if we gradually increase $ρ$ and then fix its value, though there remain gaps that warrant further theoretical development of proximal distance algorithms. For example, how large should one take the resting value of $ρ$ , and how quickly should one increment $ρ$ from its initial value? That naively using the same update schedule for $ρ$ works well empirically across the board should be considered an advantage. Nevertheless, a closer analysis of this behaviour would be fruitful, and potentially crucial in other applications.

Despite these gaps, the desirable theoretical properties and empirical prowess of the proposed proximal distance algorithm suggest that the ideas are applicable to a broad range of problems. Projection onto a closed set undergirds the principle. Fortunately, many projection operators are available in the literature, even for nonconvex sets (Bauschke & Combettes, 2011; Beck, 2017; Won et al., 2019). These successes encourage future work to extend penalized likelihood methods for covariance estimation in the patternless sparsity setting. For instance, in related problems, the local linear approximation algorithm succeeds in applying majorization-minimization for sparse estimation under alternative nonconvex penalties (Zou & Li, 2008). Exploring the extent to which our contributions can help tailor such approaches to sparse covariance estimation is nontrivial and provides a fruitful avenue for future work. We invite readers to help advance proximal distance theory and devise their own schemes of this valuable extension of the majorization-minimization principle.

Acknowledgement

We thank Jon Azose and Adrian Raftery for sharing the UN migration forecasting correlations and Jacob Bien for providing the cell signalling data. The second author is also affiliated with the Departments of Statistics and Human Genetics at the University of California, Los Angeles.

Appendix

Proof of Proposition 2

It suffices to show that a small enough step size $s$ decreases $g_{ρ} (Σ | Σ_{k})$ . Recall the form (10) which expresses $\hat{Σ}$ as $Σ_{k} - H_{k}^{- 1} \nabla q_{ρ} (Σ_{k} | Σ_{k})$ , where $H_{k}$ is the scoring approximation obtained by taking the expected value of the second differential $d^{2} g_{ρ} (Σ | Σ_{k})$ . Here we explicitly avoid writing $H_{k}$ as an unwieldy tensor, instead noting that it generates the positive-definite quadratic form $tr (Σ_{k}^{- 1} V Σ_{k}^{- 1} V)$ . In light of the identity $\nabla q_{ρ} (Σ_{k} | Σ_{k}) = \nabla g_{ρ} (Σ_{k} | Σ_{k})$ , the vector $v_{k}$ is a descent direction for $g_{ρ} (Σ | Σ_{k})$ at $Σ_{k}$ . Since the cone of positive-definite matrices is open, step-halving is also guaranteed to keep $Σ_{k + 1}$ positive definite.

Proof of Proposition 1

To establish convergence, we invoke Zangwill's global convergence theorem for descent algorithms (Zangwill, 1969; Luenberger & Ye, 1984). Recall that our algorithm map $A (Σ)$ may be set-valued because the projection $P_{C} (Σ)$ onto the sparsity constraint set can be multi-valued. Denote the set of stationary points (11) of $A (Σ)$ by $Γ$ . For convenience we reproduce the theorem statement in our notation.

Theorem A1 (Global convergence theorem). Consider the algorithm $A : X \to P (X)$ defined by a point-to-set map and an initial point $Σ_{0}$ . Let $Γ \subset X$ be a solution set and $Σ_{k + 1} \in A (Σ_{k})$ a sequence generated by $A (Σ)$ . Finally, assume that the following conditions hold.

All iterates $Σ_{k}$ are contained in a compact set $S \subset X$ .
There is a continuous function $h (Σ)$ that satisfies the following:
1. if $Σ \notin Γ$ , then $h (Θ) < h (Σ)$ for all $Θ \in A (Σ)$ ;
2. if $Σ \in Γ$ , then $h (Θ) ⩽ h (Σ)$ for all $Θ \in A (Σ)$ .
The mapping $A$ is closed at points outside $Γ$ .

Then the sequence $Σ_{k}$ has convergent subsequences, and the corresponding limits belong to the solution set.

We begin by proving the coercivity of $h_{ρ} (Σ)$ , which will imply that the sequence $Σ_{k}$ is contained in a compact set. Even if the sample covariance $S$ is singular, running our method instead on $\tilde{S} = S + δ I$ for arbitrarily small $δ$ suffices for the theory to hold. Doing so is reasonable as it is a strictly weaker assumption than relaxing the entire constraint $Σ ≻ 0$ to the set $Σ \underline{≻} δ I$ (Bien & Tibshirani, 2011).

Lemma A1. The objective function $h_{ρ} (Σ)$ of our model is coercive whenever the sample covariance matrix $S$ is nonsingular.

Proof. Since the penalty is nonnegative, it suffices to prove that $f (Σ) = ln det Σ + tr (Σ^{- 1} S)$ is coercive. Let the singular values of $Σ$ be denoted by $σ_{1} ⩾ σ_{2} ⩾ \dots > 0$ , and let the singular values of $S$ be denoted by $s_{1} ⩾ s_{2} ⩾ \dots > 0$ . It is clear that $‖Σ‖ \to \infty$ if and only if at least one $σ_{i} \to \infty$ and that $Σ^{- 1} \to \infty$ if and only if at least one $σ_{i} \to 0$ . The matrix analogue of the Cauchy-Schwarz inequality due to von Neumann and Fan tells us that $tr (Σ^{- 1} S) ⩾ \sum_{i} s_{i} / σ_{i}$ . We also have $log det Σ = \sum_{i} log σ_{i}$ . Now consider the sum $r (σ) = \sum_{i} (log σ_{i} + s_{i} / σ_{i})$ , which bounds $f (Σ)$ from below. Since each summand satisfies

min_{σ_{i}} (log σ_{i} + \frac{s_{i}}{σ_{i}}) ⩾ ln s_{i} + 1,

$r (σ)$ obviously tends to $\infty$ if and only if any $σ_{i}$ tends to 0 or $\infty$ . Equivalently, $f (Σ)$ tends to $\infty$ if and only if either $‖Σ‖$ or $‖Σ^{- 1}‖$ tends to $\infty$ . □

The proof above shows that if we set $h_{ρ} (Σ) = \infty$ where $Σ$ fails to be positive definite, then $h_{ρ} (Σ)$ is continuous. We will adopt this convention in defining the update $Σ_{k + 1} = Σ_{k} + η_{k} v_{k} \in A (Σ_{k})$ via the choice

η_{k} = \underset{η \in [0, 1]}{arg min} g_{ρ} (Σ_{k} + η v_{k} | Σ_{k}) .

Before proving the next lemma, recall that the surrogate $q_{ρ} (Σ | Σ_{k})$ is minimized by $\hat{Σ} = Σ_{k} + v_{k}$ , where $v_{k} = - H_{k}^{- 1} \nabla q_{ρ} (Σ_{k} | Σ_{k})$ and $H_{k}$ is the approximate second differential generating the quadratic form $V \mapsto tr (Σ_{k}^{- 1} V Σ_{k}^{- 1} V)$ . Elements of the solution set $Γ$ of Zangwill's theorem are characterized by the stationarity condition (11) for some $Θ \in P_{C} (Σ)$ .

Lemma A2. Some point $Θ \in A (Σ)$ decreases the objective $h_{ρ} (Σ)$ , and strictly so when $Σ \notin Γ$ . Furthermore, the algorithm map $A (Σ)$ remains within a compact set and is closed outside $Γ$ .

Proof. By definition the algorithm map decreases $h_{ρ} (Σ)$ . If $Σ$ falls outside $Γ$ , then any associated search direction $v$ can be expressed as $v = - H^{- 1} u$ , where $H$ is positive definite and $u = \nabla q_{ρ} (Σ ∣ Σ)$ is nontrivial for any choice of $Θ \in P_{C} (Σ)$ . Because

d_{v} g_{ρ} (Σ | Σ) = d_{v} q_{ρ} (Σ | Σ) = - u^{T} H^{- 1} u < 0,

it follows that $g_{ρ} (Σ | Σ)$ can be strictly decreased by moving in the direction $v$ . Hence, the objective $h_{ρ} (Σ)$ can be strictly decreased. To prove compactness, observe that $h_{ρ} (Σ)$ is both continuous and coercive. Hence, its sub-level sets $\{Σ : h_{ρ} (Σ) ⩽ c\}$ are compact. Given that the algorithm decreases $h_{ρ} (Σ)$ , all iterates remain within the compact set $\{Σ : h_{ρ} (Σ) ⩽ h_{ρ} (Σ_{0})\}$ .

To prove closedness, consider a sequence $Σ_{k}$ with limit $Σ$ and a corresponding sequence $Θ_{k} \in A (Σ_{k})$ with limit $Θ \notin Γ$ . If $f (Σ)$ is the loss function, then $u_{k} = \nabla f (Σ_{k}) + ρ (Σ_{k} - Θ_{k})$ , where $Θ_{k} \in P_{C} (Σ_{k})$ . The lack of continuity of the projection operator hinders taking limits. However, since there are only finitely many sparsity index sets, one of these sets must be chosen infinitely often along the sequence $Θ_{k}$ . Replace the sequences $Σ_{k}$ and $Θ_{k}$ by the subsequence where this occurs. One can now invoke the continuity of the projection operator and conclude that $Θ = {lim}_{k \to \infty} Θ_{k}$ exists. It follows that

v = lim_{k \to \infty} v_{k} = - H^{- 1} \{\nabla f (Σ) + ρ (Σ - Θ)\}

also exists with $Θ \in P_{C} (Σ)$ . Furthermore, $v \neq 0$ since $Σ \notin Γ$ . The step-length sequence $η_{k}$ also has a limit $η$ defined by

η = lim_{k \to \infty} \frac{{‖Θ_{k} - Σ_{k}‖}_{2}}{{‖H_{k}^{- 1} u_{k}‖}_{2}} = \frac{{‖Θ - Σ‖}_{2}}{{‖v‖}_{2}} .

It remains to prove that $Θ = Σ + η v$ is optimal. Fortunately, this follows by taking limits in the inequality $g (Σ_{k} + η_{k} v_{k}) ⩽ g (Σ_{k} + μ v_{k})$ valid for all $μ \in [0, 1]$ . □

Now we are ready to prove Theorem 1 by a direct application of Zangwill's theorem.

Proof of Theorem 1. The sub-level set $S_{h_{ρ}} (Σ_{0}) = \{Σ : h_{ρ} (Σ) ⩽ h_{ρ} (Σ_{0})\}$ is compact, and by Lemmas A1 and A2 all iterates $Σ_{k + 1} \in A (Σ_{k})$ lie in $S_{h_{ρ}} (Σ_{0})$ . These lemmas further show that (i) $Σ_{k} ≻ 0$ for every $k$ , (ii) $h_{ρ} (Σ_{k})$ is continuous, (iii) $h_{ρ} (Θ) ⩽ h_{ρ} (Σ)$ for all $Θ \in A (Σ)$ , and (iv) equality is strict here when $Σ \notin Γ$ . Furthermore, the algorithm map $A (Σ)$ is closed outside $Γ$ , the set of stationary points. Therefore, Theorem A1 applies, and every convergent subsequence of $Σ_{k}$ is a stationary point. □

Fig. A.1. — Receiver operating characteristic curves corresponding to the simulation study and display conventions of Fig. 1: (a) random, (b) moving average, and (c) cliques.

Additional simulation details

The experimental design in the first set of simulations is a direct reproduction of the design in Bien & Tibshirani (2011). The analogous results in terms of receiver operating characteristic curves are displayed in Fig. A.1. Any simulated datasets that fail to produce a positive-definite ground truth covariance matrix are resimulated. Next, all methods are seeded and run on the same synthetic datasets with matched relative tolerance. In all results, the penalty parameter $λ$ for competing methods and the sparsity level $K$ for the proposed method are selected via five-fold cross-validation with respect to Frobenius loss over a vector of 40 possible values, calibrated so that best values do not occur on either boundary of the vector. This follows the recommendations for the implementation of those methods in the R packages CVTuningCov and PDSCE (R Development Core Team, 2022). We remark that cross-validation with respect to entropy loss is more favourable to our proposed method, though reported results in Tables 1–3 are cross-validated under Frobenius loss to provide a conservative comparison against peer methods. The initial value of the parameter $ρ$ is set to 0.1 in all cases considered and is not treated as a tuning parameter.

Contributor Information

JASON XU, Department of Statistical Science, Duke University, Box 90251, Durham, North Carolina 27708, U.S.A..

KENNETH LANGE, Department of Computational Medicine, University of California, Los Angeles, Box 708822, Los Angeles, California 90095, U.S.A..

References

Azose JJ & Raftery AE (2018). Estimating large correlation matrices for international migration. Ann. Appl. Statist 12, 940–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bartels RH & Stewart GW (1972). Solution of the matrix equation AX + XB = C. Commun. ACM 15, 820–6. [Google Scholar]
Bauschie HH & Combettes PL (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces Cham, Switzerland: Springer. [Google Scholar]
BECK A (2017). First-order Methods in Optimization Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]
Beck A & Teboulle M (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci 2, 183–202. [Google Scholar]
BiCKel PJ & LeVINA E (2008a). Covariance regularization by thresholding. Ann. Statist 36, 2577–604. [Google Scholar]
Bickel PJ & LeVInA E (2008b). Regularized estimation of large covariance matrices. Ann. Statist 36, 199–227. [Google Scholar]
Bien J, Bunea F & XiaO L (2016). Convex banding of the covariance matrix. J. Am. Statist. Assoc 111, 834–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bien J & TibShIRANI RJ (2011). Sparse estimation of a covariance matrix. Biometrika 98, 807–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
BIJAK J & WIŚNIOWSKI A (2010). Bayesian forecasting of immigration to selected European countries by using expert knowledge. J. R. Statist. Soc. A 173, 775–96. [Google Scholar]
Boyd S, Parikh N, Chu E, Peleato B & Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundat. Trends Mach. Learn 3, 1–122. [Google Scholar]
CAI T & Liv W (2011). Adaptive thresholding for sparse covariance matrix estimation. J. Am. Statist. Assoc 106, 672–84. [Google Scholar]
CAI TT, ZhANG C-H & ZHOU HH (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist 38, 2118–44. [Google Scholar]
Chaudhuri S, Drton M & Richardson TS (2007). Estimation of a covariance matrix with zeros. Biometrika 94, 199–216. [Google Scholar]
Chi EC & Lange K (2014). Stable estimation of a covariance matrix guided by nuclear norm penalties. Comp. Statist. Data Anal 80, 117–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chi EC, Zhou H & Lange K (2014). Distance majorization and its applications. Math. Program 146, 409–36. [DOI] [PMC free article] [PubMed] [Google Scholar]
Courant R (1943). Variational methods for the solution of problems of equilibrium and vibrations. Bull. Am. Math. Soc 49, 1–23. [Google Scholar]
Cui Y, Leng C & Sun D (2016). Sparse estimation of high-dimensional correlation matrices. Comp. Statist. Data Anal 93, 390–403. [Google Scholar]
CUI Y, PANG J-S & SEN B (2018). Composite difference-max programs for modern statistical estimation problems. SIAM J. Optimiz 28, 3344–74. [Google Scholar]
Dempster AP, Laird NM & Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38. [Google Scholar]
FAN J, LiAO Y & LiU H (2016). An overview of the estimation of large covariance and precision matrices. Economet. J 19, C1–32. [Google Scholar]
FANG Y, WANG B & FENG Y (2016). Tuning-parameter selection in regularized estimations of large covariance matrices. J. Statist. Comp. Simul 86, 494–509. [Google Scholar]
Friedman J, Hastie T & Tibshirani R (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
Higham NJ (2002). Accuracy and Stability of Numerical Algorithms Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]
Huang JZ, Liu N, Pourahmadi M & Liu L (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, 85–98. [Google Scholar]
Karoui NE (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist 36, 2717–56. [Google Scholar]
Keys KL, Zhou H & Lange K (2019). Proximal distance algorithms: Theory and practice. J. Mach. Learn. Res 20, 1–38. [PMC free article] [PubMed] [Google Scholar]
Lam C & Fan J (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist 37, 4254–78. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lange K (2016). MM Optimization Algorithms Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]
LevinA E, RothmAn A & ZHU J (2008). Sparse estimation of large covariance matrices via a nested lasso penalty. Ann. Appl. Statist 2, 245–63. [Google Scholar]
Luenberger DG & Ye Y (1984). Linear and Nonlinear Programming Cham, Switzerland: Springer. [Google Scholar]
MaIRAL J (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optimiz 25, 829–55. [Google Scholar]
Molstad AJ & Rothman AJ (2018). Shrinking characteristics of precision matrix estimators. Biometrika 105, 563–74. [Google Scholar]
Pang J-S, RAZAVIYAYN M & Alvarado A (2017). Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res 42, 95–118. [Google Scholar]
Polson NG, SCOtT JG & Willard BT (2015). Proximal algorithms in statistics and machine learning. Statist. Sci 30, 559–81. [Google Scholar]
Pourahmadi M (2011). Covariance estimation: The GLM and regularization perspectives. Statist. Sci 26, 369–87. [Google Scholar]
R Development Core Team (2022). R: A Language and Environment for Statistical Computing Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]
Rothman AJ (2012). Positive definite estimators of large covariance matrices. Biometrika 99, 733–40. [Google Scholar]
Rothman AJ, Levina E & Zhu J (2009). Generalized thresholding of large covariance matrices. J. Am. Statist. Assoc 104, 177–86. [Google Scholar]
Sachs K, Perez O, Pe'er D, Lauffenburger DA & Nolan GP (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–9. [DOI] [PubMed] [Google Scholar]
SimONCINI V (2016). Computational methods for linear matrix equations. SIAM Review 58, 377–441. [Google Scholar]
STEIN C (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. 3rd Berkeley Symp. Math. Statist. Prob, vol. 1. Berkeley, California: University of California Press, pp. 197–206. [Google Scholar]
Won J-H, Xu J & LANGe K (2019). Projection onto Minkowski sums with application to constrained learning. Proc. Mach. Learn. Res 97, 3642–51. Proc. 36th Int. Conf. Machine Learning. [Google Scholar]
Wu WB & Pourahmadi M (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90, 831–44. [Google Scholar]
Xu J, Chi E & Lange K (2017). Generalized linear model regression under distance-to-set penalties. In Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS'17) Red Hook, New York: Curran Associates, pp. 1385–94. [Google Scholar]
Xu J & Lange K (2019). Power k-means clustering. Proc. Mach. Learn. Res 97, 6921–31. Proc. 36th Int. Conf. Machine Learning. [Google Scholar]
Xue L, MA S & ZOU H (2012). Positive-definite ℓ₁-penalized estimation of large covariance matrices. J. Am. Statist. Assoc 107, 1480–91. [Google Scholar]
YuAn M & LIN Y (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35. [Google Scholar]
Yuille AL & Rangarajan A (2003). The concave-convex procedure. Neural Comp 15, 915–36. [DOI] [PubMed] [Google Scholar]
ZANGWILl WI (1969). Nonlinear Programming: A Unified Approach Englewood Cliffs, New Jersey: Prentice-Hall. [Google Scholar]
Zou H & LI R (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist 36, 1509–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Azose JJ & Raftery AE (2018). Estimating large correlation matrices for international migration. Ann. Appl. Statist 12, 940–70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Bartels RH & Stewart GW (1972). Solution of the matrix equation AX + XB = C. Commun. ACM 15, 820–6. [Google Scholar]

[R3] Bauschie HH & Combettes PL (2011). Convex Analysis and Monotone Operator Theory in Hilbert Spaces Cham, Switzerland: Springer. [Google Scholar]

[R4] BECK A (2017). First-order Methods in Optimization Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]

[R5] Beck A & Teboulle M (2009). A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imag. Sci 2, 183–202. [Google Scholar]

[R6] BiCKel PJ & LeVINA E (2008a). Covariance regularization by thresholding. Ann. Statist 36, 2577–604. [Google Scholar]

[R7] Bickel PJ & LeVInA E (2008b). Regularized estimation of large covariance matrices. Ann. Statist 36, 199–227. [Google Scholar]

[R8] Bien J, Bunea F & XiaO L (2016). Convex banding of the covariance matrix. J. Am. Statist. Assoc 111, 834–45. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Bien J & TibShIRANI RJ (2011). Sparse estimation of a covariance matrix. Biometrika 98, 807–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] BIJAK J & WIŚNIOWSKI A (2010). Bayesian forecasting of immigration to selected European countries by using expert knowledge. J. R. Statist. Soc. A 173, 775–96. [Google Scholar]

[R11] Boyd S, Parikh N, Chu E, Peleato B & Eckstein J (2011). Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundat. Trends Mach. Learn 3, 1–122. [Google Scholar]

[R12] CAI T & Liv W (2011). Adaptive thresholding for sparse covariance matrix estimation. J. Am. Statist. Assoc 106, 672–84. [Google Scholar]

[R13] CAI TT, ZhANG C-H & ZHOU HH (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist 38, 2118–44. [Google Scholar]

[R14] Chaudhuri S, Drton M & Richardson TS (2007). Estimation of a covariance matrix with zeros. Biometrika 94, 199–216. [Google Scholar]

[R15] Chi EC & Lange K (2014). Stable estimation of a covariance matrix guided by nuclear norm penalties. Comp. Statist. Data Anal 80, 117–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Chi EC, Zhou H & Lange K (2014). Distance majorization and its applications. Math. Program 146, 409–36. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Courant R (1943). Variational methods for the solution of problems of equilibrium and vibrations. Bull. Am. Math. Soc 49, 1–23. [Google Scholar]

[R18] Cui Y, Leng C & Sun D (2016). Sparse estimation of high-dimensional correlation matrices. Comp. Statist. Data Anal 93, 390–403. [Google Scholar]

[R19] CUI Y, PANG J-S & SEN B (2018). Composite difference-max programs for modern statistical estimation problems. SIAM J. Optimiz 28, 3344–74. [Google Scholar]

[R20] Dempster AP, Laird NM & Rubin DB (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B 39, 1–38. [Google Scholar]

[R21] FAN J, LiAO Y & LiU H (2016). An overview of the estimation of large covariance and precision matrices. Economet. J 19, C1–32. [Google Scholar]

[R22] FANG Y, WANG B & FENG Y (2016). Tuning-parameter selection in regularized estimations of large covariance matrices. J. Statist. Comp. Simul 86, 494–509. [Google Scholar]

[R23] Friedman J, Hastie T & Tibshirani R (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, 432–41. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Higham NJ (2002). Accuracy and Stability of Numerical Algorithms Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]

[R25] Huang JZ, Liu N, Pourahmadi M & Liu L (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, 85–98. [Google Scholar]

[R26] Karoui NE (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. Statist 36, 2717–56. [Google Scholar]

[R27] Keys KL, Zhou H & Lange K (2019). Proximal distance algorithms: Theory and practice. J. Mach. Learn. Res 20, 1–38. [PMC free article] [PubMed] [Google Scholar]

[R28] Lam C & Fan J (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann. Statist 37, 4254–78. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Lange K (2016). MM Optimization Algorithms Philadelphia: Society for Industrial and Applied Mathematics. [Google Scholar]

[R30] LevinA E, RothmAn A & ZHU J (2008). Sparse estimation of large covariance matrices via a nested lasso penalty. Ann. Appl. Statist 2, 245–63. [Google Scholar]

[R31] Luenberger DG & Ye Y (1984). Linear and Nonlinear Programming Cham, Switzerland: Springer. [Google Scholar]

[R32] MaIRAL J (2015). Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optimiz 25, 829–55. [Google Scholar]

[R33] Molstad AJ & Rothman AJ (2018). Shrinking characteristics of precision matrix estimators. Biometrika 105, 563–74. [Google Scholar]

[R34] Pang J-S, RAZAVIYAYN M & Alvarado A (2017). Computing B-stationary points of nonsmooth DC programs. Math. Oper. Res 42, 95–118. [Google Scholar]

[R35] Polson NG, SCOtT JG & Willard BT (2015). Proximal algorithms in statistics and machine learning. Statist. Sci 30, 559–81. [Google Scholar]

[R36] Pourahmadi M (2011). Covariance estimation: The GLM and regularization perspectives. Statist. Sci 26, 369–87. [Google Scholar]

[R37] R Development Core Team (2022). R: A Language and Environment for Statistical Computing Vienna, Austria: R Foundation for Statistical Computing. ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]

[R38] Rothman AJ (2012). Positive definite estimators of large covariance matrices. Biometrika 99, 733–40. [Google Scholar]

[R39] Rothman AJ, Levina E & Zhu J (2009). Generalized thresholding of large covariance matrices. J. Am. Statist. Assoc 104, 177–86. [Google Scholar]

[R40] Sachs K, Perez O, Pe'er D, Lauffenburger DA & Nolan GP (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science 308, 523–9. [DOI] [PubMed] [Google Scholar]

[R41] SimONCINI V (2016). Computational methods for linear matrix equations. SIAM Review 58, 377–441. [Google Scholar]

[R42] STEIN C (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proc. 3rd Berkeley Symp. Math. Statist. Prob, vol. 1. Berkeley, California: University of California Press, pp. 197–206. [Google Scholar]

[R43] Won J-H, Xu J & LANGe K (2019). Projection onto Minkowski sums with application to constrained learning. Proc. Mach. Learn. Res 97, 3642–51. Proc. 36th Int. Conf. Machine Learning. [Google Scholar]

[R44] Wu WB & Pourahmadi M (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90, 831–44. [Google Scholar]

[R45] Xu J, Chi E & Lange K (2017). Generalized linear model regression under distance-to-set penalties. In Proc. 31st Int. Conf. Neural Information Processing Systems (NIPS'17) Red Hook, New York: Curran Associates, pp. 1385–94. [Google Scholar]

[R46] Xu J & Lange K (2019). Power k-means clustering. Proc. Mach. Learn. Res 97, 6921–31. Proc. 36th Int. Conf. Machine Learning. [Google Scholar]

[R47] Xue L, MA S & ZOU H (2012). Positive-definite ℓ₁-penalized estimation of large covariance matrices. J. Am. Statist. Assoc 107, 1480–91. [Google Scholar]

[R48] YuAn M & LIN Y (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94, 19–35. [Google Scholar]

[R49] Yuille AL & Rangarajan A (2003). The concave-convex procedure. Neural Comp 15, 915–36. [DOI] [PubMed] [Google Scholar]

[R50] ZANGWILl WI (1969). Nonlinear Programming: A Unified Approach Englewood Cliffs, New Jersey: Prentice-Hall. [Google Scholar]

[R51] Zou H & LI R (2008). One-step sparse estimates in nonconcave penalized likelihood models. Ann. Statist 36, 1509–33. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

A proximal distance algorithm for likelihood-based sparse covariance estimation

JASON XU

KENNETH LANGE

Summary

1. Introduction

2. Background and Penalized Formulation

3. Majorization-minimization