Condition Number Regularized Covariance Estimation

Joong-Ho Won; Johan Lim; Seung-Jean Kim; Bala Rajaratnam

doi:10.1111/j.1467-9868.2012.01049.x

. Author manuscript; available in PMC: 2014 Jun 1.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2012 Dec 4;75(3):427–450. doi: 10.1111/j.1467-9868.2012.01049.x

Condition Number Regularized Covariance Estimation^{^*}

Joong-Ho Won ¹, Johan Lim ², Seung-Jean Kim ³, Bala Rajaratnam ⁴

PMCID: PMC3667751 NIHMSID: NIHMS406560 PMID: 23730197

Abstract

Estimation of high-dimensional covariance matrices is known to be a difficult problem, has many applications, and is of current interest to the larger statistics community. In many applications including so-called the “large p small n” setting, the estimate of the covariance matrix is required to be not only invertible, but also well-conditioned. Although many regularization schemes attempt to do this, none of them address the ill-conditioning problem directly. In this paper, we propose a maximum likelihood approach, with the direct goal of obtaining a well-conditioned estimator. No sparsity assumption on either the covariance matrix or its inverse are are imposed, thus making our procedure more widely applicable. We demonstrate that the proposed regularization scheme is computationally efficient, yields a type of Steinian shrinkage estimator, and has a natural Bayesian interpretation. We investigate the theoretical properties of the regularized covariance estimator comprehensively, including its regularization path, and proceed to develop an approach that adaptively determines the level of regularization that is required. Finally, we demonstrate the performance of the regularized estimator in decision-theoretic comparisons and in the financial portfolio optimization setting. The proposed approach has desirable properties, and can serve as a competitive procedure, especially when the sample size is small and when a well-conditioned estimator is required.

Keywords: covariance estimation, regularization, convex optimization, condition number, eigenvalue, shrinkage, cross-validation, risk comparisons, portfolio optimization

1 Introduction

We consider the problem of regularized covariance estimation in the Gaussian setting. It is well known that, given n independent samples x₁, …, x_n ∈ ℝ^p from a zero-mean p-variate Gaussian distribution, the sample covariance matrix

S = \frac{1}{n} \sum_{i = 1}^{n} x_{i} x_{i}^{T},

maximizes the log-likelihood as given by

\begin{array}{l} l (\sum) = log \prod_{i = 1}^{n} \frac{1}{{(2 π)}^{p} {(det \sum)}^{\frac{1}{2}}} exp (- \frac{1}{2} x_{i}^{T} \sum^{- 1} x_{i}) \\ = - (n p / 2) log (2 π) - (n / 2) (tr (\sum^{- 1} S) - log det \sum^{- 1}), \end{array}

(1)

where det A and tr(A) denote the determinant and trace of a square matrix A respectively. In recent years, the availability of high-throughput data from various applications has pushed this problem to an extreme where in many situations, the number of samples, n, is often much smaller than the dimension of the estimand, p. When n < p the sample covariance matrix S is singular, not positive definite, and hence cannot be inverted to compute the precision matrix (the inverse of the covariance matrix), which is also needed in many applications. Even when n > p, the eigenstructure tends to be systematically distorted unless p/n is extremely small, resulting in numerically ill-conditioned estimators for Σ; see Dempster (1972) and Stein (1975). For example, in mean-variance portfolio optimization (Markowitz, 1952), an ill-conditioned covariance matrix may amplify estimation error because the optimal portfolio involves matrix inversion (Ledoit and Wolf, 2003; Michaud, 1989). A common approach to mitigate the problem of numerical stability is regularization.

In this paper, we propose regularizing the sample covariance matrix by explicitly imposing a constraint on the condition number.¹ Instead of using the standard estimator S, we propose to solve the following constrained maximum likelihood (ML) estimation problem

\begin{array}{l} maximize & l (\sum) \\ subject to & cond (\sum) \leq κ_{max}, \end{array}

(2)

where cond(M) stands for the condition number, a measure of numerical stability, of a matrix M (see Section 1.1 for details). The matrix M is invertible if cond(M) is finite, ill-conditioned if cond(M) is finite but high (say, greater than 10³ as a rule of thumb), and well-conditioned if cond(M) is moderate. By bounding the condition number of the estimate by a regularization parameter κ_max, we directly address the problem of invertibility or ill-conditioning. This direct control is appealing because the true covariance matrix is in most situations unlikely to be ill-conditioned whereas its sample counterpart is often ill-conditioned. It turns out that the resulting regularized matrix falls into a broad family of Steinian-type shrinkage estimators that shrink the eigenvalues of the sample covariance matrix towards a given structure (James and Stein, 1961; Stein, 1956). Moreover, the regularization parameter κ_max is adaptively selected from the data using cross validation.

Numerous authors have explored alternative estimators for Σ (or Σ⁻¹) that perform better than the sample covariance matrix S from a decision-theoretic point of view. Many of these estimators give substantial risk reductions compared to S in small sample sizes, and often involve modifying the spectrum of the sample covariance matrix. A simple example is the family of linear shrinkage estimators which take a convex combination of the sample covariance matrix and a suitably chosen target or regularization matrix. Notable in the area is the seminal work of Ledoit and Wolf (2004) who study a linear shrinkage estimator toward a specified target covariance matrix, and choose the optimal shrinkage to minimize the Frobenius risk. Bayesian approaches often directly yield estimators which shrink toward a structure associated with a pre-specified prior. Standard Bayesian covariance estimators yield a posterior mean Σ that is a linear combination of S and the prior mean. It is easy to show that the eigenvalues of such estimators are also linear shrinkage estimators of Σ; see, e.g., Haff (1991). Other nonlinear Steinian-type estimators have also been proposed in the literature. James and Stein (1961) study a constant risk minimax estimator and its modification in a class of orthogonally invariant estimators. Dey and Srinivasan (1985) provide another minimax estimator which dominates the James-Stein estimator. Yang and Berger (1994) and Daniels and Kass (2001) consider a reference prior and hierarchical priors, that respectively yield posterior shrinkage.

Likelihood-based approaches using multivariate Gaussian models have provided different perspectives on the regularization problem. Warton (2008) derives a novel family of linear shrinkage estimators from a penalized maximum likelihood framework. This formulation enables cross-validation of the regularization parameter, which we discuss in Section 3 for the proposed method. Related work in the area include Sheena and Gupta (2003), Pourahmadi et al. (2007), and Ledoit and Wolf (2012). An extensive literature review is not undertaken here, but we note that the approaches mentioned above (and the one proposed in this paper) fall in the class of covariance estimation and related problems which do not assume or impose sparsity, on either the covariance matrix, or its inverse (for such approaches either in the frequentist, Bayesian, or testing frameworks, the reader is referred to Banerjee et al. (2008); Friedman et al. (2008); Hero and Rajaratnam (2011, 2012); Khare and Rajaratnam (2011); Letac and Massam (2007); Peng et al. (2009); Rajaratnam et al. (2008)).

1.1 Regularization by shrinking sample eigenvalues

We briefly review Steinian-type eigenvalue shrinkage estimators in this subsection. Dempster (1972) and Stein (1975) noted that the eigenstructure of the sample covariance matrix S tends to be systematically distorted unless p/n is extremely small. They observed that the larger eigenvalues of S are overestimated whereas the smaller ones are underestimated. This observation led to estimators which directly modify the spectrum of the sample covariance matrix and are designed to “shrink” the eigenvalues together. Let l_i, i = 1, …, p, denote the eigenvalues of the sample covariance matrix (sample eigenvalues) in nonincreasing order (l₁ ≥ … ≥ l_p ≥ 0). The spectral decomposition of the sample covariance matrix is given by

S = Q diag (l_{1}, \dots, l_{p}) Q^{T},

(3)

where diag(l₁, …, l_p) is the diagonal matrix with diagonal entries l_i and Q ∈ ℝ^p^×^p is the orthogonal matrix whose i-th column is the eigenvector that corresponds to the eigenvalue l_i. As discussed above, a large number of covariance estimators regularizes S by modifying its eigenvalues with the explicit goal of better estimating the eigenspectrum. In this light Stein (1975) proposed the class of orthogonally invariant estimators of the following form:

\sum^{^} = Q diag ({\hat{λ}}_{1}, \dots, {\hat{λ}}_{p}) Q^{T} .

(4)

Typically, these estimators shrink the sample eigenvalues so that the modified eigenspectrum is less spread than that of the sample covariance matrix. In many estimators, the shrunk eigenvalues are required to maintain the original order as those of S: λ̂₁ ≥ ··· ≥ λ̂_p.

One well-known example of Steinian-type shrinkage estimators is the linear shrinkage estimator as given by

{\sum^{^}}_{LS} = (1 - δ) S + δ F, 0 \leq δ \leq 1

(5)

where the target matrix F = cI for some c > 0 (Ledoit and Wolf, 2004; Warton, 2008). For the linear estimator the relationship between the sample eigenvalues l_i and the modified eigenvalues λ̂_i is affine:

{\hat{λ}}_{i} = (1 - δ) l_{i} + δ c

Another example, Stein’s estimator (Stein, 1975, 1986), denoted by Σ̂_Stein, is given by λ̂_i = l_i/d_i, i = 1, …, p, with d_i = (n − p + 1 + 2l_i Σ_j_≠_i(l_i − l_j)⁻¹)/n. The original order in the estimator is preserved by applying isotonic regression (Lin and Perlman, 1985).

1.2 Regularization by imposing a condition number constraint

Now we proceed to introduce the condition number-regularized covariance estimator proposed in this paper. Recall that the condition number of a positive definite matrix Σ is defined as

cond (\sum) = λ_{max} (\sum) / λ_{min} (\sum)

where λ_max(Σ) and λ_min(Σ) are the maximum and the minimum eigenvalues of Σ, respectively. (Understand that cond(Σ) = ∞ if λ_min(Σ) = 0.) The condition number-regularized covariance estimation problem (2) can therefore be formulated as

\begin{array}{l} maximize & l (\sum) \\ subject to & λ_{max} (\sum) / λ_{min} (\sum) \leq κ_{max} . \end{array}

(6)

An implicit condition is that Σ be symmetric and positive definite².

The estimation problem (6) can be reformulated as a convex optimization problem in the matrix variable Σ⁻¹ (see Section 2). Standard methods such as interior-point methods can solve the convex problem when the number of variables (i.e., entries in the matrix) is modest, say, under 1000. Since the number of variables is about p(p + 1)/2, the limit is around p = 45. Such a naive approach is not adequate for moderate to high dimensional problems. One of the main contributions of this paper is a significant improvement on the solution method for (6) so that it scales well to much larger sizes. In particular, we show that (6) reduces to an unconstrained univariate optimization problem. Furthermore, the solution to (6), denoted by Σ̂_cond, has a Steinian-type shrinkage of the form as in (4) with eigenvalues given by

{\hat{λ}}_{i} = min (max (τ^{★}, l_{i}), κ_{max} τ^{★}) = {\begin{array}{l} τ^{★}, & l_{i} \leq τ^{★} \\ l_{i}, & τ^{★} < l_{i} < κ_{max} τ^{★} \\ κ_{max} τ^{★}, & l_{i} \geq κ_{max} τ^{★}, \end{array}

(7)

for some τ^★ > 0. Note that the modified eigenvalues are nonlinear functions of the sample eigenvalues, meet the order constraint of Section 1.1, and even when n < p, the nonlinear shrinkage estimator Σ̂_cond is well-conditioned. The quantity τ^★ is determined adaptively from the data and the choice of the regularization parameter κ_max. This solution method was first considered by two of the authors of this paper in a previous conference proceeding (Won and Kim, 2006). In this paper, we give a formal proof of the assertion that the matrix optimization problem (6) reduces to an equivalent unconstrained univariate minimization problem. We further elaborate on the proposed method by showing rigorously that τ^★ can be found exactly and easily with computational effort of order O(p) (Section 2.1).

The nonlinear shrinkage in (7) has a simple interpretation: the eigenvalues of the estimator Σ̂_cond are obtained by truncating the sample eigenvalues larger than κ_maxτ^★ to κ_maxτ^★ and those smaller than τ^★ to τ^★. Figure 1 illustrates the functional form of (7) in comparison with that of the linear shrinkage estimator. This novel shrinkage form is rather surprising, because the original motivation of the regularization in (6) is to numerically stabilize the covariance estimate. Note that this type of shrinkage better preserves the eccentricity of the sample estimate than the linear shrinkage which shrinks it toward a spherical covariance matrix.

Comparison of eigenvalue shrinkage of the linear shrinkage estimator (left) and the condition number-constrained estimator (right).

Other major contributions of the paper include a detailed analysis of the regularization path of the shrinkage enabled by a geometric perspective of the condition number-regularized estimator (Section 2.2), and an adaptive selection procedure for choosing the regularization parameter κ_max under the maximum predictive likelihood criterion (Section 3) and properties thereof.

We also undertake a Bayesian analysis of the condition number constrained estimator (Section 4). A detailed application of the methodology to real data, which demonstrates the usefulness of the method in a practical setting, is also provided (Section 6). In particular, Section 6 studies the use of the proposed estimator in the mean-variance portfolio optimization setting. Section 5 undertakes a simulation study in order to compare the risk of the condition number-regularized estimator to those of others. Risk comparisons in higher dimensional settings (as compared to those given in the preliminary conference paper) are provided in this Section. Asymptotic properties of the estimator are also investigated in Section 5.

2 Condition number-regularized covariance estimation

2.1 Derivation and characterization of solution

This section gives the derivation of the solution (7) of the condition number-regularized covariance estimation problem as given in (6) and shows how to compute τ^★ given κ_max. Note that it suffices to consider the case κ_max < l₁/l_p = cond(S), since otherwise, the condition number constraint is already satisfied and the solution to (6) simply reduces to the sample covariance matrix S.

It is well known that the log-likelihood (1) of a multivariate Gaussian covariance matrix is a convex function of Ω = Σ⁻¹. Note that Ω is the canonical parameter for the (Wishart) natural exponential family associated with the likelihood in (1). Since cond(Σ) = cond(Ω), it can be shown that the condition number constraint on Ω is equivalent to the existence of u > 0 such that uI ⪯ Ω ⪯ κ_maxuI, where A ⪯ B denotes that B − A is positive semidefinite (Boyd and Vandenberghe, 2004, Chap. 7). Therefore the covariance estimation problem (6) is equivalent to

\begin{array}{l} minimize & tr (Ω S) - log det Ω \\ subject to & u I ≺ Ω ≺ κ_{max} u I, \end{array}

(8)

where the variables are a symmetric positive definite p×p matrix Ω and a scalar u > 0. The above problem in (8) is a convex optimization problem with p(p + 1)/2 + 1 variables, i.e., O(p²).

The following lemma shows that by exploiting structure of the problem it can be reduced to a univariate convex problem, i.e., the dimension of the system is only of O(1) as compared to O(p²).

Lemma 1

The optimization problem (8) is equivalent to the unconstrained univariate convex optimization problem

minimize \sum_{i = 1}^{p} J_{κ_{max}}^{(i)} (u),

(9)

where

J_{κ_{max}}^{(i)} (u) = l_{i} μ_{i}^{★} (u) - log μ_{i}^{★} (u) = {\begin{array}{l} l_{i} (κ_{max} u) - log (κ_{max} u), & u < 1 / (κ_{max} l_{i}) \\ 1 + log l_{i}, & 1 / (κ_{max} l_{i}) \leq u \leq 1 / l_{i} \\ l_{i} u - log u, & u > 1 / l_{i}, \end{array}

and

μ_{i}^{★} (u) = min {max {u, 1 / l_{i}}, κ_{max} u},

(10)

in the sense that the solution to (8) is a function of the solution u^★ to (9), as follows.

Ω^{★} = Q diag (μ_{1}^{★} (u^{★}), \dots, μ_{p}^{★} (u^{★})) Q^{T},

with Q defined as in (3).

Proof

The proof is given in the Appendix.

Characterization of the solution to (9) is given by the following theorem.

Theorem 1

Given κ_max ≤ cond(S), the optimization problem (9) has a unique solution given by

u^{★} = \frac{α^{★} + p - β^{★} + 1}{\sum_{i = 1}^{α^{★}} l_{i} + \sum_{i = β^{★}}^{p} κ_{max} l_{i}},

(11)

where α^★ ∈ {1, …, p} is the largest index such that 1/l_α_★ < u^★ and β^★ ∈ {1, …, p} is the smallest index such that 1/l_β_★ > κ_maxu^★. The quantities α^★ and β^★ are not determined a priori but can be found in O(p) operations on the sample eigenvalues l₁ ≥ … ≥ l_p. If κ_max > cond(S), the maximizer u^★ is not unique but Σ̂_cond = S for all the maximizing values of u.

Proof

The proof is given in the Appendix.

Comparing (10) to (7), it is immediately seen that

τ^{★} = 1 / (κ_{max} u^{★}) = \frac{\sum_{i = 1}^{α^{★}} l_{i} / κ_{max} + \sum_{i = β^{★}}^{p} l_{i}}{α^{★} + p - β^{★} + 1} .

(12)

Note that the lower cutoff level τ^★ is an average of the (scaled and) truncated eigenvalues, in which the eigenvalues above the upper cutoff level κ_maxτ^★ are shrunk by a factor of 1/κ_max.

We highlight the fact that a reformulation of the original minimization into a univariate optimization problem makes the proposed methodology very attractive in high dimensional settings. The method is only limited by the complexity of spectral decomposition of the sample covariance matrix (or the singular value decomposition of the data matrix). The approach proposed here is therefore much faster than using interior point methods. We also note that the condition number-regularized covariance estimator is orthogonally invariant: if Σ̂_cond is the estimator of the true covariance matrix Σ, then UΣ̂_condU^T is the estimator of the true covariance matrix UΣU^T, for U orthogonal.

2.2 A geometric perspective on the regularization path

In this section, we shall show that a simple relaxation of the optimization problem (8) provides an intuitive geometric perspective on the condition number-regularized estimator. Consider the function

J (u, v) = min_{u I ≺ Ω ≺ v I} (tr (Ω S) - log det Ω)

(13)

defined as the minimum of the objective of (8) over a range uI ⪯Ω ⪯ vI, where 0 < u ≤ v. Note that the relaxation in (13) above differs from the original problem in the sense that the optimization is no longer with respect to the variable u. In particular, u and v are fixed in (13). By fixing u, the problem has now been significantly simplified. Paralleling Lemma 1, it is easily shown that

J (u, v) = \sum_{i = 1}^{p} min_{u \leq μ_{i} \leq v} (l_{i} μ_{i} - log μ_{i}) .

Recall that the truncation range of the sample eigenvalues is therefore given by (1/v, 1/u). Now for given u, let α ∈ {1, …, p} be the largest index such that l_α > 1/u and β ∈ {1,…, p} be the smallest index such that l_β < 1/v, i.e., the set of indexes where truncation at either end of the spectrum first starts to become a binding constraint. With this convention it is easily shown that J(u, v) can now be expressed in simpler form:

J (u, v) = \sum_{i = 1}^{p} (l_{i} μ_{i}^{★} (u, v) - log μ_{i}^{★} (u, v))

(14)

= \sum_{i = 1}^{α} (l_{i} u - log u) + \sum_{i = α + 1}^{β - 1} (1 + log l_{i}) + \sum_{i = β}^{p} (l_{i} v - log v),

(15)

where

μ_{i}^{★} (u, v) = min {max {u, 1 / l_{i}}, v} = {\begin{array}{l} u, & 1 \leq i \leq α \\ 1 / l_{i}, & α < i < β \\ v, & β \leq i \leq p . \end{array}

(16)

Comparing (16) to (10), we observe that (Ω^*)⁻¹, the covariance estimate whose inverse achieves the minimum in (13), is obtained by truncating the eigenvalues of S greater than 1/u to 1/u, and less than 1/v to 1/v.

Furthermore, note that the function J(u, v) has the following properties:

J does not increase as u decreases and v increases simultaneously. This follows from noting that simultaneously decreasing u and increasing v expands the domain of the minimization in (13).
J(u, v) = J(1/l₁, l/l_p) for u ≤ 1/l₁ and v ≥ 1/l_p. Hence J(u, v) is constant on this part of the domain. For these values of u and v, (Ω^★)⁻¹ = S.
J(u, v) is almost everywhere differentiable in the interior of the domain {(u, v) : 0 < u ≤ v}, except for on the lines u = 1/l₁,…, 1/l_p and v = 1/l₁,…, 1/l_p. This follows from noting the the indexes α and β changes their values only on these lines. Hence the contribution to the three summands in (14) changes at these values.

We can now see the following obvious relation between the function J(u, v) and the original problem (8): the solution to (8) as given by u^★ is the minimizer of J(u, v) on the line v = κ_maxu, i.e., the original univariate optimization problem (9) is equivalent to minimizing J(u, κ_maxu). We denote this minimizer by u^★(κ_max) and investigate how u^★(κ_max) behaves as κ_max varies. The following proposition proves that u^★(κ_max) is monotone in κ_max. This result sheds light on the regularization path, i.e., the solution path of u^★ as κ_max varies.

Proposition 1

u^★(κ_max) is nonincreasing in κ_max and v^★(κ_max), ≜ κ_maxu^★(κ_max) is nondecreasing, both almost surely.

Proof

The proof is given in the Appendix.

Remark

The above proposition has a very natural and intuitive interpretation: when the constraint on the condition number is relaxed to allow higher value of κ_max, the gap between u^★ and v^★ widens so that the ratio of v^★ to u^★ remains at κ_max. This implies that as κ_max is increased, the lower truncation value u^★ decreases and the higher truncation value v^★ increases. Proposition 1 can be equivalently interpreted by noting that the optimal truncation range τ^★(κ_max), κ_maxτ^★(κ_max) of the sample eigenvalues is nested.

In light of the solution to the condition number-regularized covariance estimation problems in (7), Proposition 1 also implies that once an eigenvalue l_i is truncated for κ_max = ν₀, then it remains truncated for all κ_max < ν₀. Hence the regularized eigenvalue estimates are not only continuous, but they are also monotonic in the sense that they approach either end of the truncation spectrum as the regularization parameter κ_max is decreased to 1. This monotonicity of the condition number-regularized estimates gives a desirable property that is not always enjoyed by other regularized estimators, such as the lasso for instance (Personal communications: Jerome Friedman, Department of Statistics, Stanford University).

With the above theoretical knowledge on the regularization path of the sample eigenvalues, we now provide an illustration of the regularization path; see Figure 2. More specifically, consider the plot of the path of the optimal point (u^★(κ_max), v^★(κ_max)) on the u–v plane from (u^★(1), u^★(1)) to (1/l₁, 1/l_p) when varying κ_max from 1 to cond(S). The left panel shows the path of (u^★(κ_max), v^★(κ_max)) on the u–v plane for the case where the sample eigenvalues are (21, 7, 5.25, 3.5, 3). Here a point on the path represents the minimizer of J(u, v) on a line v = κ_maxu (hollow circle). The path starts from a point on the solid line v = u (κ_max = 1, square) and ends at (1/l₁, 1/l_p), where the dashed line v = cond(S) · u passes (κ_max = cond(S), solid circle). Note that the starting point (κ_max = 1) corresponds to Σ̂_cond = γI for some data-dependent γ > 0 and the end point (κ_max = cond(S)) to Σ̂_cond = S. When κ_max > cond(S), multiple values of u^★ are achieved in the shaded region above the dashed line, yielding nevertheless the same estimator S. The right panel of Figure 2 shows how the eigenvalues of the estimated covariance vary as a function of κ_max. Here we see that as the constraint is made stricter the eigenvalue estimates decrease monotonically. Furthermore, the truncation range of the eigenvalues simultaneously decreases and remains nested.

Regularization path of the condition number constrained estimator. (a) Path of (u^★(κ_max)*, v*^★(κ_max)) on the u-v plane, for sample eigenvalues (21, 7, 5.25, 3.5, 3) (thick curve). (b) Regularization path of the same sample eigenvalues as a function of κ_max. Note that the estimates decreases monotonically as the condition number constraint κ_max decreases to 1.

3 Selection of regularization parameter κ_max

Sections 2.1 and 2.2 have already discussed how the optimal truncation range (τ^★, κ_maxτ^★) is determined for a given regularization parameter κ_max, and how it varies with the value of κ_max. This section proposes a data-driven (or adaptive) criterion for selecting the optimal κ_max and undertakes an analysis of this approach.

3.1 Predictive risk selection procedure

A natural approach is to select κ_max that minimizes the predictive risk, or the expected negative predictive likelihood as given by

PR (ν) = E [E_{\tilde{X}} {tr (\sum_{ν}^{- 1} \tilde{X} {\tilde{X}}^{T}) - log det {\sum^{^}}_{ν}^{- 1} ∣ x_{1}, \dots, x_{n}}],

(17)

where Σ̂_ν is the estimated condition number-regularized covariance matrix from independent samples x₁, ···, x_n, with the value of the regularization parameter κ_max set to ν, and X̃ ∈ ℝ^p is a random vector, independent of the given samples, from the same distribution. We approximate the predictive risk using K-fold cross validation. The K-fold cross validation approach divides the data matrix $X = (x_{1}^{T}, \dots, x_{n}^{T})$ into K groups so that $X^{T} = (X_{1}^{T}, \dots, X_{K}^{T})$ with n_k observations in the k-th group, k = 1, …, K. For the k-th iteration, each observation in the k-th group X_k plays the role of X̃ in (17), and the remaining K − 1 groups are used together to estimate the covariance matrix, denoted by ${\sum^{^}}_{ν}^{[- k]}$ . The approximation of the predictive risk using the k-th group reduces to the predictive log-likelihood

l_{k} ({\sum^{^}}_{ν}^{[- k]}, X_{k}) = - (n_{k} / 2) [tr {{({\sum^{^}}_{ν}^{[- k]})}^{- 1} X_{k} X_{k}^{T} / n_{k}} - log det {({\sum^{^}}_{ν}^{[- k]})}^{- 1}] .

The estimate of the predictive risk is then defined as

\hat{PR} (ν) = - \frac{1}{n} \sum_{k = 1}^{K} l_{k} ({\sum^{^}}_{ν}^{[- k]}, X_{k}) .

(18)

The optimal value for the regularization parameter κ_max is selected as ν that minimizes (18), i.e.,

{\hat{κ}}_{max} = inf (arg inf_{ν \geq 1} \hat{PR} (ν)) .

Note that the outer infimum is taken since $l_{k} ({\sum^{^}}_{ν}^{[- k]}, X_{k})$ is constant for ν ≥ cond(S^[−^k^]), where S^[−^k^] is the k-th fold sample covariance matrix based on the remaining K − 1 groups.

3.2 Properties of the optimal regularization parameter

We proceed to investigate the behavior of the selected regularization parameter κ̂_max, both theoretically and in simulations. We first note that κ̂_max is a consistent estimator for the true condition number κ. This fact is expressed below as one of the several properties of κ̂_max:

(P1)
For a fixed p, as n increases, κ̂_max approaches κ in probability, where κ is the condition number of the true covariance matrix Σ.

This statement is stated formally below:

Theorem 2

For a given p,

lim_{n \to \infty} pr ({\hat{κ}}_{max} = κ) = 1.

Proof

The proof is given in Supplemental Section A.

We further investigate the behavior of κ_max in simulations. To this end, consider iid samples from a zero-mean p-variate Gaussian distribution with the following covariance matrices:

Identity matrix in ℝ^p.
diag(1, r, r², …, r^p), with condition number 1/r^p = 5.
diag(1, r, r², …, r^p), with condition number 1/r^p = 400.
Toeplitz matrix whose (i, j)th element is 0.3^|ⁱ⁻^j^| for i, j = 1, …, p.

We consider different combinations of sample sizes and dimensions of the problem as given by n = 20, 80, 320 and p = 5, 20, 80. For each of these cases 100 replicates are generated and κ̂_max computed with 5-fold cross validation. The behavior of κ̂_max is plotted in Figure 3 and leads to insightful observations. A summary of the properties of the κ̂_max is given in (P2)–(P4) below:

Box plots summarizing the distribution of κ̂_max for dimensions p = 5, 20, 80 and for sample sizes n = 20, 80, 320 for the following covariance matrices (a) identity (b) diagonal exponentially decreasing, condition number 5, (c) diagonal exponentially decreasing, condition number 400, (d) Toeplitz matrix whose (*i, j*)th element is 0.3^|ⁱ⁻^j^| for *i, j* = 1, 2,…, p.

(P2)
If the condition number of the true covariance matrix remains finite as p increases, then for a fixed n, κ̂_max decreases.
(P3)
If the condition number of the true covariance matrix remains finite as p increases, then for a fixed n, κ̂_max converges to 1.
(P4)
The variance of κ̂_max decreases as either n or p increases.

These properties are analogous to those of the optimal regularization parameter δ̂ for the linear shrinkage estimator (5), found using a similar predictive risk criterion (Warton, 2008).

4 Bayesian interpretation

In the same spirit as the Bayesian posterior mode interpretation of the LASSO (Tibshirani, 1996), we can draw parallels for the condition number regularized covariance estimator. The condition number constraint given by λ₁(Σ)/λ_p(Σ) ≤ κ_max is similar to adding a penalty term g_max(λ₁(Σ)/λ_p(Σ)) to the likelihood equation for the eigenvalues:

\begin{array}{l} maximize & exp (- (n / 2) \sum_{i = 1}^{p} l_{i} / λ_{i}) {(\prod_{i = 1}^{p} λ_{i})}^{- n / 2} exp (- g_{max} λ_{1} / λ_{p}) \\ subject to & λ_{1} \geq \dots \geq λ_{p} > 0 \end{array}

The above expression allows us to qualitatively interpret the condition number-regularized estimator as the Bayes posterior mode under the following prior

π (λ_{1}, \dots, λ_{p}) = exp (- g_{max} λ_{1} / λ_{p}), λ_{1} \geq \dots \geq λ_{p} > 0

(19)

for the eigenvalues, and an independent Haar measure on the Stiefel manifold, as the prior for the eigenvectors. The aforementioned prior on the eigenvalues has useful interesting properties which help to explain the type of eigenvalue truncation described in previous sections. We note that the prior is improper but the posterior is always proper.

Proposition 2

The prior on the eigenvalues in (19) is improper, whereas the posterior yields a proper distribution. More formally,

\int_{C} π (\underline{λ}) d \underline{λ} = \int_{C} exp (- g_{max} λ_{1} / λ_{p}) d \underline{λ} = \infty,

and

\int_{C} π (\underline{λ}) f (\underline{λ}, \underline{l}) d \underline{λ} \propto \int_{C} exp (- (n / 2) \sum_{i = 1}^{p} l_{i} / λ_{i}) {(\prod_{i = 1}^{p} λ_{i})}^{- n / 2} exp (- g_{max} λ_{1} / λ_{p}) d \underline{λ} < \infty,

where λ = (λ₁,…, λ_p) and C = {λ : λ₁ ≥ ··· ≥ λ_p > 0}.

Proof

The proof is given in Supplemental Section A.

The prior above puts the greatest mass around the region {λ ∈ ℝ^p : λ₁ = ··· = λ_p} which consequently encourages “shrinking” or “pulling” the eigenvalues closer together. Note that the support of both the prior and the posterior is the entire space of ordered eigenvalues. Hence the prior simply by itself does not immediately yield a hard constraint on the condition number. Evaluating the posterior mode yields an estimator that satisfies the condition number constraint.

A clear picture of the regularization achieved by the prior above and its potential for “eigenvalue shrinkage” emerges when compared to the other types of priors suggested in the literature and the corresponding Bayes estimators. The standard MLE implies of course a completely flat prior on the constrained space C. A commonly used inverse Wishart conjugate prior Σ⁻¹ ~ Wishart(m, cI) yields a posterior mode which is a linear shrinkage estimator (5) with δ = m/(n + m). Note however that the coefficients of the combination do not depend of the data X, and are a function only of the sample size n and the degrees of freedom or shape parameter from the prior, m. A useful prior for covariance matrices that yields a data-dependent posterior mode is the reference prior proposed by Yang and Berger (1994). For this prior, the eigenvalues are inversely proportional to the determinant of the the covariance matrix, as given by $\prod_{i = 1}^{p} λ_{i}$ , and also encourages shrinkage of the eigenvalues. The posterior mode using this reference prior can be formulated similarly to that of condition number regularization:

arg {max}_{λ_{1} \geq \dots \geq λ_{p} > 0} exp (- (n / 2) \sum_{i = 1}^{p} l_{i} / λ_{i}) {(\prod_{i = 1}^{p} λ_{i})}^{- n / 2} {(\prod_{i = 1}^{p} λ_{i})}^{- 1} = arg {min}_{λ_{1} \geq \dots \geq λ_{p} > 0} (n / 2) \sum_{i = 1}^{p} l_{i} / λ_{i} + ((n + 2) / 2) \sum_{i = 1}^{p} log λ_{i} .

An examination of the penalty implied by the reference prior suggests that there is no direct penalty on the condition number. In Supplemental Section B we illustrate the density of the priors discussed above in the two-dimensional case. In particular, the density of the condition number regularization prior places more emphasis on the line λ₁ = λ₂ thus “squeezing” the eigenvalues together. This is in direct contrast with the inverse Wishart or reference priors where this shrinkage effect is not as pronounced.

5 Decision-theoretic risk properties

5.1 Asymptotic properties

We now show that the condition number-regularized estimator Σ̂_cond has asymptotically lower risk than the sample covariance matrix S with respect to entropy loss. Recall that the entropy loss, also known as Stein’s loss, is defined as follows:

L_{ent} (\sum^{^}, \sum) = tr (\sum^{- 1} \sum^{^}) - log det (\sum^{- 1} \sum^{^}) - p .

(20)

Let λ₁, …, λ_p, with λ₁ ≥ ··· ≥ λ_p, denote the eigenvalues of the true covariance matrix Σ and Λ = diag(λ₁, …, λ_p). Define λ = (λ₁, …, λ_p), ${\underline{λ}}^{- 1} = (λ_{1}^{- 1}, \dots, λ_{p}^{- 1})$ , and κ = λ₁/λ_p.

First consider the trivial case when p > n. In this case, the sample covariance matrix S is singular regardless of the singularity of Σ, and Inline graphic (S, Σ) = ∞, whereas the loss and therefore the risk of Σ̂_cond are always finite. Thus, Σ̂_cond has smaller entropy risk than S.

For p ≤ n, if the true covariance matrix has a finite condition number, it can be shown that for a properly chosen κ_max, the condition number-regularized estimator asymptotically dominates the sample covariance matrix. This assertion is formalized below.

Theorem 3

Consider a class of covariance matrices Inline graphic (κ, ω), whose condition numbers are bounded above by κ and with minimum eigenvalue bounded below by ω > 0, i.e.,

D (κ, ω) = {\sum = R Λ R^{T} : R orthogonal, Λ = diag (λ_{1}, \dots, λ_{p}), ω \leq λ_{p} \leq \dots \leq λ_{1} \leq κ ω} .

Then, the following results hold.

Consider the quantity Σ̃(κ_max, ω) = Qdiag(λ̃₁, …, λ̃_p)Q^T, where
${\hat{λ}}_{i} = {\begin{array}{l} ω, & i f l_{i} \leq ω \\ l_{i}, & i f ω \leq l_{i} < κ_{max} ω \\ κ_{max} ω, & i f l_{i} \geq κ_{max} ω, \end{array}$

and the sample covariance matrix given as S = Qdiag(l₁, …, l_p)Q^T, Q orthogonal, l₁ ≥ … ≥ l_p. If the true covariance matrix Σ ∈ (κ_max, ω), then ∀ n, Σ̂(κ_max, ω) has a smaller entropy risk than S.
Consider a true covariance matrix Σ whose condition number is bounded above by κ, i.e., Σ ∈ ∪_ω_>0 (κ, ω). If $κ_{max} \geq κ {(1 - \sqrt{γ})}^{- 2}$ , then as p/n → γ ∈ (0, 1),
$pr ({\sum \in D (κ_{max}, τ^{★})} eventually) = 1,$

where τ^★ = τ^★(κ_max) is given in (12).

Proof

The proof is given in Supplemental Section A.

Combining the two results above, we conclude that the estimator Σ̂_cond = Σ̂(κ_max, τ^★(κ_max)) asymptotically has a lower entropy risk than the sample covariance matrix.

5.2 Finite sample performance

This section undertakes a simulation study in order to compare the finite-sample risks of the condition number-regularized estimator Σ̂_cond with those of the sample covariance matrix (S) and the linear shrinkage estimator (Σ̂_LS) in the “large p, small n” setting. The regularization parameter δ for Σ̂_LS is chosen as prescribed in Warton (2008). The optimal Σ̂_cond is calculated using the adaptive parameter selection method outlined in Section 3. Since Σ̂_cond and Σ̂_LS both select the regularization parameters similarly, i.e., by minimizing the empirical predictive risk (18), a meaningful comparison between two estimators can be made. We consider two loss functions traditionally used in covariance estimation risk comparisons: (a) entropy loss as given in (20) and (b) quadratic loss $L_{Q} (\sum^{^}, \sum) = {‖ \sum^{^} \sum^{- 1} - I ‖}_{F}^{2}$ .

Condition number regularization applies shrinkage to both ends of the sample eigenvalue spectrum and does not affect the middle part, whereas linear shrinkage does this to the entire spectrum uniformly. Therefore, it is expected that Σ̂_cond works well when a small proportion of eigenvalues are found at the extremes. Such situations rise very naturally when only a few eigenvalues explain most of the variation in data. To understand the performance of the estimators in this context the following scenarios were investigated. We consider diagonal matrices of dimensions p = 120, 250, 500 as true covariance matrices. The eigenvalues (diagonal elements) are dichotomous, where the “high” values are (1 − ρ)+ ρp and the “low” values are 1 − ρ. For each p, we vary the composition of the diagonal elements such that the high values take only one (singleton), 10%, 20%, 30%, and 40% of the total number of p eigenvalues. The sample size n is chosen so that γ = p/n is approximately 1.25, 2, or 4. Note that for a given p, the condition number of the true covariance matrices is held fixed at 1 + pρ/(1 − ρ) regardless of the composition of eigenvalues. For each of the simulation scenarios, we generate 1000 data sets and compute 1000 estimates of the true covariance matrix. The risks are calculated by averaging the losses over these 1000 estimates.

Figure 4 presents the results for ρ = 0.1, and represents a large condition number. It is observed that in general Σ̂_cond has less risk than Σ̂_LS, which in turn has less risk than the sample covariance matrix (entropy loss is not defined for the sample covariance matrix). This phenomenon is most clear when the eigenvalue spectrum has a singleton in the extreme. In this case, Σ̂_cond gives a risk reduction between 27 % and 67 % for entropy loss, and between 67 % and 91 % for quadratic loss. The risk reduction tends to be more pronounced in high dimensional scenarios, i.e., for p large and n small. The performance of Σ̂_cond over Σ̂_LS is maintained until the “high” eigenvalues compose up to 30 % of the eigenvalue spectrum. Comparing the two loss functions, risk reduction of Σ̂_cond is more distinct in quadratic loss. We note that for quadratic loss with large p and large proportion of “high” eigenvalues, there are cases that the sample covariance matrix can perform well.

Average risks (with error bars) over 1000 runs with respect to two loss functions when ρ= 0.1. sample=sample covariance matrix, Warton=linear shrinkage (Warton, 2008), CondReg=condition number regularization. Risks are normalized by the dimension (p).

As an example of a moderate condition number, results for the ρ = 0.5 case is given in Supplemental Section C. General trends are the same as with the ρ = 0.1 case.

In summary, the risk comparison study provides a numerical evidence that condition number regularization has merit when the true covariance matrix has a bimodal eigenvalue distribution and/or the true condition number is large.

6 Application to portfolio selection

This section illustrates the merits of the condition number regularization in the context of financial portfolio optimization, where a well-conditioned covariance estimator is necessary. A portfolio refers to a collection of risky assets held by an institution or an individual. Over the holding period, the return on the portfolio is the weighted average of the returns on the individual assets that constitutes the portfolio, in which the weight associated with each asset corresponds to its proportion in monetary terms. The objective of portfolio optimization is to determine the weights that maximize the return on the portfolio. Since the asset returns are stochastic, a portfolio always carries a risk of loss. Hence the objective is to maximize the overall return subject to a given level of risk, or equivalently to minimize risk for a given level of return. Mean-variance portfolio (MVP) theory (Markowitz, 1952) uses the standard deviation of portfolio returns to quantify the risk. Estimation of the covariance matrix of asset returns thus becomes critical in the MVP setting. An important and difficult component of MVP theory is to estimate the expected return on the portfolio (Luenberger, 1998; Merton, 1980). Since the focus of this paper lies in estimating covariance matrices and not expected returns, we shall focus on determining the minimum variance portfolio which only requires an estimate of the covariance matrix; see, e.g., Chan et al. (1999). For this we shall use the condition number regularization, linear shrinkage and the sample covariance matrix, in constructing a minimum variance portfolio. We compare their respective performance over a period of more than 14 years.

6.1 Minimum variance portfolio rebalancing

We begin with a formal description of the minimum variance portfolio selection problem. The universe of assets consists of p risky assets, denoted 1,…, p. We use r_i to denote the return of asset i over one period; that is, its change in price over one time period divided by its price at the beginning of the period. Let Σ denote the covariance matrix of r = (r₁,…, r_p). We employ w_i to denote the weight of asset i in the portfolio held throughout the period. A long position in asset i corresponds to w_i > 0, and a short position corresponds to w_i < 0. The portfolio is therefore unambiguously represented by the vector of weights w = (w₁,…, w_p). Without loss of generality, the budget constraint can be written as 1^Tw = 1, where 1 is the vector of all ones. The risk of a portfolio is measured by the standard deviation (w^TΣw)^1/2 of its return.

Now the minimum variance portfolio selection problem can be formulated as

\begin{array}{l} minimize & w^{T} \sum w \\ subject to & 1^{T} w = 1. \end{array}

(21)

This is a simple quadratic program that has an analytic solution w^★ = (1^TΣ⁻¹1)⁻¹Σ⁻¹1. In practice, the parameter Σ has to be estimated.

The standard portfolio selection problem described above assumes that the returns are stationary, which is of course not realistic. As a way of dealing with the nonstationarity of returns, we employ a minimum variance portfolio rebalancing (MVR) strategy as follows. Let $r^{(t)} = (r_{1}^{(t)}, \dots, r_{p}^{(t)}) \in R^{p}$ , t = 1, …, N_tot, denote the realized returns of assets at time t (the time unit under consideration can be a day, a week, or a month). The periodic minimum variance rebalancing strategy is implemented by updating the portfolio weights every L time units, i.e., the entire trading horizon is subdivided into blocks each consisting of L time units. At the start of each block, we determine the minimum variance portfolio weights based on the past N_estim observations of returns. We shall refer to N_estim as the estimation horizon size. The portfolio weights are then held constant for L time units during these “holding” periods, i.e., during each of these blocks, and subsequently updated at the beginning of the following one. For simplicity, we shall assume the entire trading horizon consists of N_tot = N_estim + KL time units, for some positive integer K, i.e., there will be K updates. (The last rebalancing is done at the end of the entire period, and so the out-of-sample performance of the rebalanced portfolio for this holding period is not taken into account.) We therefore have a series of portfolios w⁽^j⁾ = (1^T(Σ̂⁽^j⁾)⁻¹1)⁻¹(Σ̂⁽^j⁾)⁻¹1 over the holding periods of [N_estim +1+(j − 1)L, N_estim + jL], j = 1,…, K. Here Σ̂⁽^j⁾ is the covariance matrix of the asset returns estimated from those over the jth holding period.

6.2 Empirical out-of-sample performance

In this empirical study, we use the 30 stocks that constituted the Dow Jones Industrial Average as of July 2008 (Supplemental Section D.1 lists these 30 stocks). We used the closing prices adjusted daily for all applicable splits and dividend distributions downloaded from Yahoo! Finance (http://finance.yahoo.com/). The whole period considered in our numerical study is from the trading date of December 14, 1992 to June 6, 2008 (this period consists of 4100 trading days). We consider weekly returns: the time unit is 5 consecutive trading days. We take

\begin{array}{l} N_{tot} = 820, & L = 15, & N_{estim} = 15, 30, 45, 60. \end{array}

To estimate the covariance matrices, we use the last N_estim weekly returns of the constituents of the Dow Jones Industrial Average.³ The entire trading horizon corresponds to K = 48 holding periods, which span the dates from February 18, 1994 to June 6, 2008. In what follows, we compare the MVR strategy where the covariance matrices are estimated using the condition number regularization with those that use either the sample covariance matrix or linear shrinkage. We employ two linear shrinkage schemes: that of Warton (2008) of Section 5.2 and that of Ledoit and Wolf (2004). The latter is widely accepted as a well-conditioned estimator in the financial literature.

Performance metrics

We use the following quantities in assessing the performance of the MVR strategies. For precise formulae of these metrics, refer to Supplemental Section D.3.

Realized return. The realized return of the portfolio over the trading period.
Realized risk. The realized risk (return standard deviation) of the portfolio over the trading period.
Realized Sharpe ratio. The realized excess return, with respect to the risk-free rate, per unit risk of the portfolio.
Turnover. Amount of new portfolio assets purchased or sold over the trading period.
Normalized wealth growth. Accumulated wealth yielded by the portfolio over the trading period when the initial budget is normalized to one, taking the transaction cost into account.
Size of the short side. The proportion of the short side (negative) weights to the sum of the absolute weights of the portfolio.

We assume that the transaction costs are the same for the 30 stocks and set them to 30 basis points. The risk-free rate is set at 5% per annum.

Comparison results

Figure 5 shows the normalized wealth growth over the trading horizon for four different values of N_estim. The sample covariance matrix failed in solving (21) for N_estim = 15 because of its singularity and hence is omitted in this figure. The MVR strategy using the condition number-regularized covariance matrix delivers higher growth as compared to using the sample covariance matrix, linear shrinkage or index tracking in this performance metric. The higher growth is realized consistently across the 14 year trading period and is regardless of the estimation horizon. A trading strategy based on the condition number-regularized covariance matrix consistently performs better than the S&P 500 index and can lead to profits as much as 175% more than its closest competitor.

Normalized wealth growth results of the minimum variance rebalancing strategy for various estimation horizon sizes over the trading period from February 18, 1994 through June 6, 2008. sample=sample covariance matrix, LW=linear shrinkage (Ledoit and Wolf, 2004), Warton=linear shrinkage (Warton, 2008), condreg=condition number regularization. For comparison, the S&P 500 index for the same period (i.e., index tracking), with the initial price normalized to 1, is also plotted.

A useful result appears after further analysis. There is no significant difference between the condition number regularization approach and the two linear shrinkage schemes in terms of the realized return, risk, and Sharpe ratio. Supplemental Section D.4 summarizes these metrics for each estimator respectively averaged over the trading period. For all values of N_estim, the average differences of the metrics between the two regularization schemes are within two standard errors of those. Hence the condition number regularized estimator delivers better normalized wealth growth than the other estimators but without compromising on other measures such as volatility.

The turnover of the portfolio seems to be one of the major driving factors of the difference in wealth growth. In particular, the MVR strategy using the condition number-regularized covariance matrix gives far lower turnover and thus more stable weights than when using the linear shrinkage estimator or the sample covariance matrix. (See Supplemental Section D.5 for plots.) A lower turnover also implies less transaction costs, thereby also partially contributing to the higher wealth growth. Note that there is no explicit limit on turnover. The stability of the MVR portfolio using the condition number regularization appears to be related to its small size of the short side (reported in Supplemental Section D.6). Because stock borrowing is expensive, the condition number regularization based strategy can be advantageous in practice.

Supplementary Material

Supp Material S1

NIHMS406560-supplement-Supp_Material_S1.pdf^{(823.4KB, pdf)}

Acknowledgments

We thank the editor and the associate editor for useful comments that improved the presentation of the paper. J. Won was partially supported by the US National Institutes of Health (NIH) (MERIT Award R37EB02784) and by the US National Science Foundation (NSF) grant CCR 0309701. J. Lim’s research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (grant number: 2010-0011448). B. Rajaratnam was supported in part by NSF under grant nos. DMS-09-06392, DMS-CMG 1025465, AGS-1003823, DMS-1106642 and grants NSA H98230-11-1-0194, DARPA-YFA N66001-11-1-4131, and SUWIEVP10-SUFSC10-SMSCVISG0906.

Appendix

Proof of Lemma 1

Recall the spectral decomposition of the sample covariance matrix S = QLQ^T, with L = diag(l₁,…, l_p) and l₁ ≥ … ≥ l_p ≥ 0. From the objective function in (8), suppose the variable Ω has the spectral decomposition RMR^T, with R orthogonal and M = diag(μ₁,…, μ_p), μ₁ ≤ … ≤ μ_p. Then the objective function in (8) can be written as

\begin{array}{l} tr (Ω S) - log det (Ω) = tr ({RMR}^{T} {QLQ}^{T}) - log det ({RMR}^{T}) \\ = l (R, M) \\ \geq tr (M L) - log det M = l (Q, M), \end{array}

with equality in the last line when R = Q (Farrell, 1985, Ch. 14). Hence (8) amounts to

\begin{array}{l} minimize & \sum_{i = 1}^{p} (l_{i} μ_{i} - log μ_{i}) \\ subject to & u \leq μ_{1} \leq \dots \leq μ_{p} \leq κ_{max} u, i = 1, \dots, p, \end{array}

(22)

with the optimization variables μ₁, …, μ_p and u.

For the moment we shall ignore the order constraints among the eigenvalues. Then problem (22) becomes separable in μ₁,…, μ_p. Call this related problem (22*). For a fixed u, the minimizer of each summand of the objective function in (22) without the order constraints is given as

μ_{i}^{★} (u) = \underset{u \leq μ_{i} \leq κ_{max} u}{argmin} (l_{i} μ_{i} - log μ_{i}) = min {max {u, 1 / l_{i}}, κ_{max} u} .

(23)

Note that (23) however satisfies the order constraints. In other words, $μ_{1}^{★} (u) \leq \dots \leq μ_{p}^{★} (u)$ for all u. Therefore (22*) is equivalent to (22). Plugging (23) in (22) removes the constraints and the objective function reduces to a univariate one:

J_{κ_{max}} (u) ≜ \sum_{i = 1}^{p} J_{κ_{max}}^{(i)} (u),

(24)

where

J_{κ_{max}}^{(i)} (u) = l_{i} μ_{i}^{★} (u) - log μ_{i}^{★} (u) = {\begin{array}{l} l_{i} (κ_{max} u) - log (κ_{max} u), & u < 1 / (κ_{max} l_{i}) \\ 1 + log l_{i}, & 1 / (κ_{max} l_{i}) \leq u \leq 1 / l_{i} \\ l_{i} u - log u, & u > 1 / l_{i} . \end{array}

The function (24) is convex, since each $J_{κ_{max}}^{(i)}$ is convex in u.

Proof of Theorem 1

The function $J_{κ_{max}}^{(i)} (u)$ is convex and is constant on the interval [1/(κ_maxl_i), 1/l_i]. Thus, the function $J_{κ_{max}} (u) ≜ \sum_{i = 1}^{p} J_{κ_{max}}^{(i)} (u)$ has a region on which it is a constant if and only if

[1 / (κ_{max} l_{1}), 1 / l_{1}] \cap [1 / (κ_{max} l_{p}), 1 / l_{p}] \neq \emptyset,

or equivalently, κ_max > cond(S). Therefore, provided that κ_max ≤ cond(S), the convex function J_κ_max (u) does not have a region on which it is constant. Since J_κ_max (u) is strictly decreasing for 0 < u < 1/(κ_maxl₁) and strictly increasing for u > 1/l_p, it has a unique minimizer u^★. If κ_max > cond(S), the maximizer u^★ may not be unique because J_κ_max (u) has a plateau. However, since the condition number constraint becomes inactive, Σ̂_cond = S for all the maximizers.

Now assume that κ_max ≤ cond(S). For α ∈ {1,…, p − 1} and β ∈ {2, …, p}, define the following two quantities.

u_{α, β} = \frac{α + p - β + 1}{\sum_{i = 1}^{α} l_{i} + \sum_{i = β}^{p} κ_{max} l_{i}} and v_{α, β} = κ_{max} u_{α, β} .

By construction, u_α,β coincides with u^★ if and only if

1 / l_{α} < u_{α, β} \leq 1 / l_{α + 1} and 1 / l_{β - 1} \leq κ_{max} u_{α, β} < 1 / l_{β} .

(25)

Consider a set of rectangles {R_α,β} in the uv-plane such that

R_{α, β} = {(u, v) : 1 / l_{α} < u \leq 1 / l_{α + 1} and 1 / l_{β - 1} \leq v < 1 / l_{β}} .

Then condition (25) is equivalent to

(u_{α, β}, v_{α, β}) \in R_{α, β}

(26)

in the uv-plane. Since {R_α,β} partitions {(u, v) : 1/l₁ < u ≤ 1/l_p and 1/l₁ ≤ v < 1/l_p} and (u_α,β, v_α,β) is on the line v = κ_maxu, an obvious algorithm to find the pair (α, β) that satisfies the condition (26) is to keep track of the rectangles R_α,β that intersect this line. To understand that algorithm takes O(p) operations, start from the origin of the uv-plane, increase u and v along the line v = κ_maxu. Since κ_max > 1, if the line intersects R_α,β, then the next intersection occurs in one of the three rectangles: R_α₊₁_,β, R_α,β₊₁, and R_α₊₁_,β₊₁. Therefore after finding the first intersection (which is on the line u = 1/l₁), the search requires at most 2p tests to satisfy condition (26). Finding the first intersection takes at most p tests.

Proof of Proposition 1

Recall that, for κ_max = ν₀,

u^{★} (ν_{0}) = \frac{α + p - β + 1}{\sum_{i = 1}^{α} l_{i} + ν_{0} \sum_{i = β}^{p} l_{i}}

and

v^{★} (ν_{0}) = ν_{0} u^{★} (ν_{0}) = \frac{α + p - β + 1}{\frac{1}{ν_{0}} \sum_{i = 1}^{α} l_{i} + \sum_{i = β}^{p} l_{i}},

where α = α(ν₀) ∈ {1,…, p} is the largest index such that 1/l_α < u^★(ν₀) and β = β(ν₀) ∈ {1,…, p} is the smallest index such that 1/l_β > ν₀u^★(ν₀). Then

1 / l_{α} < u^{★} (ν_{0}) \leq 1 / l_{α + 1}

and

1 / l_{β - 1} \leq v^{★} (ν_{0}) < 1 / l_{β} .

The lower and upper bounds u^★(ν₀) and v^★(ν₀) of the reciprocal sample eigenvalues can be divided into four cases:

1/l_α < u^★(ν₀) < 1/l_α₊₁ and 1/l_β₋₁ < v^★(ν₀) < 1/l_β.

We can find ν > ν₀ such that
$1 / l_{α} < u^{★} (ν) \leq 1 / l_{α + 1}$

and
$1 / l_{β - 1} \leq v^{★} (ν) < 1 / l_{β} .$

Therefore,
$u^{★} (ν) = \frac{α + p - β + 1}{\sum_{i = 1}^{α} l_{i} + ν \sum_{i = β}^{p} l_{i}} < \frac{α + p - β + 1}{\sum_{i = 1}^{α} l_{i} + ν_{0} \sum_{i = β}^{p} l_{i}} = u^{★} (ν_{0})$

and
$v^{★} (ν) = \frac{α + p - β + 1}{\frac{1}{ν_{0}} \sum_{i = 1}^{α} l_{i} + \sum_{i = β}^{p} l_{i}} > \frac{α + p - β + 1}{\frac{1}{ν} \sum_{i = 1}^{α} l_{i} + \sum_{i = β}^{p} l_{i}} = v^{★} (ν_{0}) .$
u^★(ν₀) = 1/l_α₊₁ and 1/l_β₋₁ < v^★(ν₀) < 1/l_β.

Suppose u^★(ν) > u^★(ν₀). Then we can find ν > ν₀ such that α(ν) = α(ν₀) + 1 = α + 1 and β(ν) = β(ν₀) = β. Then,
$u^{★} (ν) = \frac{α + 1 + p - β + 1}{\sum_{i = 1}^{α + 1} l_{i} + ν \sum_{i = β}^{p} l_{i}} .$

Therefore,
$\begin{array}{l} \frac{1}{u^{★} (ν_{0})} - \frac{1}{u^{★} (ν)} = 1 / l_{α + 1} - \frac{\sum_{i = 1}^{α + 1} l_{i} + ν \sum_{i = β}^{p} l_{i}}{α + 1 + p - β + 1} \\ = \frac{(α + p - β + 1) l_{α + 1} - (\sum_{i = 1}^{α + 1} l_{i} + ν \sum_{i = β}^{p} l_{i})}{α + 1 + p - β + 1} > 0, \end{array}$

or
$l_{α + 1} > \frac{\sum_{i = 1}^{α + 1} l_{i} + ν \sum_{i = β}^{p} l_{i}}{α + p - β + 1} > \frac{\sum_{i = 1}^{α + 1} l_{i} + ν_{0} \sum_{i = β}^{p} l_{i}}{α + p - β + 1} = l_{α + 1},$

which is a contradiction. Therefore, u^★(ν) ≤ u^★(ν₀).

Now, we can find ν > ν₀ such that α(ν) = α(ν₀) = α and β(ν) = β(ν₀) = β. This reduces to case 1.
1/l_α < u^★(ν₀) < 1/l_α₊₁ and v^★(ν₀) = 1/l_β₋₁.

Suppose v^★(ν) < v^★(ν₀). Then we can find ν > ν₀ such that α(ν) = α(ν₀) = α and β(ν) = β(ν₀) − 1 = β − 1. Then,
$v^{★} (ν) = \frac{α + p - β + 2}{\frac{1}{ν} \sum_{i = 1}^{α} l_{i} + \sum_{i = β - 1}^{p} l_{i}} .$

Therefore,
$\begin{array}{l} \frac{1}{v^{★} (ν_{0})} - \frac{1}{v^{★} (ν)} = 1 / l_{β - 1} - \frac{\sum_{i = 1}^{α} l_{i} + ν \sum_{i = β - 1}^{p} l_{i}}{α + p - β + 2} \\ = \frac{(α + p - β + 1) l_{β - 1} - (\sum_{i = 1}^{α} l_{i} + ν \sum_{i = β}^{p} l_{i})}{α + p - β + 2} < 0, \end{array}$

or
$l_{β - 1} < \frac{\sum_{i = 1}^{α} \frac{1}{ν} l_{i} + \sum_{i = β - 1}^{p} l_{i}}{α + p - β + 1} < \frac{\sum_{i = 1}^{α + 1} \frac{1}{ν_{0}} l_{i} + \sum_{i = β}^{p} l_{i}}{α + p - β + 1} = l_{β - 1},$

which is a contradiction. Therefore, v^★(ν) ≥ v^★(ν₀).

Now, we can find ν > ν₀ such that α(ν) = α(ν₀) = α and β(ν) = β(ν₀) = β. This reduces to case 1.
u^★(ν₀) = 1/l_α₊₁ and v^★(ν₀) = 1/l_β₋₁. 1/l_α₊₁ = u^★(ν₀) = v^★(ν₀)/ν₀ = 1/(ν₀l_β₋₁). This is a measure zero event and does not affect the conclusion.

Footnotes

A preliminary version of the paper has appeared in an unrefereed conference proceedings previously.

This procedure was first considered by two of the authors of this paper in a previous conference paper and is further elaborated in this paper (see Won and Kim (2006)).

This problem can be considered a generalization of the problem studied by Sheena and Gupta (2003), who consider imposing either a fixed lower bound or a fixed upper bound on the eigenvalues. Their approach is however different from ours in a fundamental sense in that it is not designed to control the condition number. Hence such a solution does not correct for the overestimation of the largest eigenvalues and underestimation of the small eigenvalues simultaneously.

Supplemental Section D.2 shows the periods determined by the choice of the parameters.

Supplemental materials Accompanying supplemental materials contain additional proofs of Theorems 2 and 3, and Proposition 2; additional figures illustrating Bayesian prior densities; additional figure illustrating risk simulations; details of the empirical minimum variance rebalancing study.

Contributor Information

Joong-Ho Won, School of Industrial Management Engineering, Korea University, Seoul, Korea.

Johan Lim, Department of Statistics, Seoul National University, Seoul, Korea.

Seung-Jean Kim, Citi Capital Advisors, New York, NY, USA.

Bala Rajaratnam, Department of Statistics, Stanford University, Stanford, CA, USA.

References

Banerjee O, El Ghaoui L, D’Aspremont A. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]
Chan N, Karceski N, Lakonishok J. On portfolio optimization: Forecasting covariances and choosing the risk model. Review of Financial Studies. 1999;12(5):937–974. [Google Scholar]
Daniels M, Kass R. Shrinkage estimators for covariance matrices. Biometrics. 2001;57:1173–1184. doi: 10.1111/j.0006-341x.2001.01173.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dempster AP. Covariance Selection. Biometrics. 1972;28(1):157–175. [Google Scholar]
Dey DK, Srinivasan C. Estimation of a covariance matrix under Stein’s loss. The Annals of Statistics. 1985;13(4):1581–1591. [Google Scholar]
Farrell RH. Multivariate calculation. Springer-Verlag; New York: 1985. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haff LR. The variational form of certain Bayes estimators. The Annals of Statistics. 1991;19(3):1163–1190. [Google Scholar]
Hero A, Rajaratnam B. Large-scale correlation screening. Journal of the American Statistical Association. 2011;106(496):1540–1552. [Google Scholar]
Hero A, Rajaratnam B. Hub discovery in partial correlation graphs. Information Theory, IEEE Transactions on. 2012;58(9):6064–6078. [Google Scholar]
James W, Stein C. Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability; Stanford, California, United States. 1961. pp. 361–379. [Google Scholar]
Khare K, Rajaratnam B. Wishart distributions for decomposable covariance graph models. The Annals of Statistics. 2011;39(1):514–555. [Google Scholar]
Ledoit O, Wolf M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance. 2003 Dec;10(5):603–621. [Google Scholar]
Ledoit O, Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis. 2004;88:365–411. [Google Scholar]
Ledoit O, Wolf M. Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics. 2012 Jul;40(2):1024–1060. [Google Scholar]
Letac G, Massam H. Wishart distributions for decomposable graphs. The Annals of Statistics. 2007;35(3):1278–1323. [Google Scholar]
Lin S, Perlman M. A Monte-Carlo comparison of four estimators of a covariance matrix. Multivariate Analysis. 1985;6:411–429. [Google Scholar]
Luenberger DG. Investment science. Oxford University Press; New York: 1998. [Google Scholar]
Markowitz H. Portfolio selection. Journal of Finance. 1952;7(1):77–91. [Google Scholar]
Merton R. On estimating expected returns on the market: An exploratory investigation. Journal of Financial Economics. 1980;8:323–361. [Google Scholar]
Michaud RO. The Markowitz Optimization Enigma: Is Optimized Optimal. Financial Analysts Journal. 1989;45(1):31–42. [Google Scholar]
Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104(486):735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pourahmadi M, Daniels MJ, Park T. Simultaneous modelling of the cholesky decomposition of several covariance matrices. Journal of Multivariate Analysis. 2007 Mar;98(3):568–587. [Google Scholar]
Rajaratnam B, Massam H, Carvalho C. Flexible covariance estimation in graphical Gaussian models. The Annals of Statistics. 2008;36(6):2818–2849. [Google Scholar]
Sheena Y, Gupta A. Estimation of the multivariate normal covariance matrix under some restrictions. Statistics & Decisions. 2003;21:327–342. [Google Scholar]
Stein C. Technical Report 6. Dept. of Statistics, Stanford University; 1956. Some problems in multivariate analysis Part I. [Google Scholar]
Stein C. Estimation of a covariance matrix. Reitz Lecture, IMS-ASA Annual Meeting.1975. [Google Scholar]
Stein C. Lectures on the theory of estimation of many parameters (English translation) Journal of Mathematical Sciences. 1986;34(1):1373–1403. [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(1):267–288. [Google Scholar]
Warton DI. Penalized Normal Likelihood and Ridge Regularization of Correlation and Covariance Matrices. Journal of the American Statistical Association. 2008;103(481):340–349. [Google Scholar]
Won JH, Kim S-J. Maximum Likelihood Covariance Estimation with a Condition Number Constraint. Proceedings of the Fortieth Asilomar Conference on Signals, Systems and Computers; 2006. pp. 1445–1449. [Google Scholar]
Yang R, Berger JO. Estimation of a covariance matrix using the reference prior. The Annals of Statistics. 1994;22(3):1195–1211. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material S1

NIHMS406560-supplement-Supp_Material_S1.pdf^{(823.4KB, pdf)}

[R1] Banerjee O, El Ghaoui L, D’Aspremont A. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]

[R2] Boyd S, Vandenberghe L. Convex Optimization. Cambridge University Press; 2004. [Google Scholar]

[R3] Chan N, Karceski N, Lakonishok J. On portfolio optimization: Forecasting covariances and choosing the risk model. Review of Financial Studies. 1999;12(5):937–974. [Google Scholar]

[R4] Daniels M, Kass R. Shrinkage estimators for covariance matrices. Biometrics. 2001;57:1173–1184. doi: 10.1111/j.0006-341x.2001.01173.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Dempster AP. Covariance Selection. Biometrics. 1972;28(1):157–175. [Google Scholar]

[R6] Dey DK, Srinivasan C. Estimation of a covariance matrix under Stein’s loss. The Annals of Statistics. 1985;13(4):1581–1591. [Google Scholar]

[R7] Farrell RH. Multivariate calculation. Springer-Verlag; New York: 1985. [Google Scholar]

[R8] Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Haff LR. The variational form of certain Bayes estimators. The Annals of Statistics. 1991;19(3):1163–1190. [Google Scholar]

[R10] Hero A, Rajaratnam B. Large-scale correlation screening. Journal of the American Statistical Association. 2011;106(496):1540–1552. [Google Scholar]

[R11] Hero A, Rajaratnam B. Hub discovery in partial correlation graphs. Information Theory, IEEE Transactions on. 2012;58(9):6064–6078. [Google Scholar]

[R12] James W, Stein C. Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability; Stanford, California, United States. 1961. pp. 361–379. [Google Scholar]

[R13] Khare K, Rajaratnam B. Wishart distributions for decomposable covariance graph models. The Annals of Statistics. 2011;39(1):514–555. [Google Scholar]

[R14] Ledoit O, Wolf M. Improved estimation of the covariance matrix of stock returns with an application to portfolio selection. Journal of Empirical Finance. 2003 Dec;10(5):603–621. [Google Scholar]

[R15] Ledoit O, Wolf M. A well-conditioned estimator for large-dimensional covariance matrices. Journal of Multivariate Analysis. 2004;88:365–411. [Google Scholar]

[R16] Ledoit O, Wolf M. Nonlinear shrinkage estimation of large-dimensional covariance matrices. The Annals of Statistics. 2012 Jul;40(2):1024–1060. [Google Scholar]

[R17] Letac G, Massam H. Wishart distributions for decomposable graphs. The Annals of Statistics. 2007;35(3):1278–1323. [Google Scholar]

[R18] Lin S, Perlman M. A Monte-Carlo comparison of four estimators of a covariance matrix. Multivariate Analysis. 1985;6:411–429. [Google Scholar]

[R19] Luenberger DG. Investment science. Oxford University Press; New York: 1998. [Google Scholar]

[R20] Markowitz H. Portfolio selection. Journal of Finance. 1952;7(1):77–91. [Google Scholar]

[R21] Merton R. On estimating expected returns on the market: An exploratory investigation. Journal of Financial Economics. 1980;8:323–361. [Google Scholar]

[R22] Michaud RO. The Markowitz Optimization Enigma: Is Optimized Optimal. Financial Analysts Journal. 1989;45(1):31–42. [Google Scholar]

[R23] Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104(486):735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Pourahmadi M, Daniels MJ, Park T. Simultaneous modelling of the cholesky decomposition of several covariance matrices. Journal of Multivariate Analysis. 2007 Mar;98(3):568–587. [Google Scholar]

[R25] Rajaratnam B, Massam H, Carvalho C. Flexible covariance estimation in graphical Gaussian models. The Annals of Statistics. 2008;36(6):2818–2849. [Google Scholar]

[R26] Sheena Y, Gupta A. Estimation of the multivariate normal covariance matrix under some restrictions. Statistics & Decisions. 2003;21:327–342. [Google Scholar]

[R27] Stein C. Technical Report 6. Dept. of Statistics, Stanford University; 1956. Some problems in multivariate analysis Part I. [Google Scholar]

[R28] Stein C. Estimation of a covariance matrix. Reitz Lecture, IMS-ASA Annual Meeting.1975. [Google Scholar]

[R29] Stein C. Lectures on the theory of estimation of many parameters (English translation) Journal of Mathematical Sciences. 1986;34(1):1373–1403. [Google Scholar]

[R30] Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B (Methodological) 1996;58(1):267–288. [Google Scholar]

[R31] Warton DI. Penalized Normal Likelihood and Ridge Regularization of Correlation and Covariance Matrices. Journal of the American Statistical Association. 2008;103(481):340–349. [Google Scholar]

[R32] Won JH, Kim S-J. Maximum Likelihood Covariance Estimation with a Condition Number Constraint. Proceedings of the Fortieth Asilomar Conference on Signals, Systems and Computers; 2006. pp. 1445–1449. [Google Scholar]

[R33] Yang R, Berger JO. Estimation of a covariance matrix using the reference prior. The Annals of Statistics. 1994;22(3):1195–1211. [Google Scholar]

PERMALINK

Condition Number Regularized Covariance Estimation*

Joong-Ho Won

Johan Lim

Seung-Jean Kim

Bala Rajaratnam

Abstract

1 Introduction

1.1 Regularization by shrinking sample eigenvalues

1.2 Regularization by imposing a condition number constraint

Figure 1.

2 Condition number-regularized covariance estimation

2.1 Derivation and characterization of solution

Lemma 1

Proof

Theorem 1

Proof

2.2 A geometric perspective on the regularization path

Proposition 1

Proof

Remark

Figure 2.

3 Selection of regularization parameter κmax

3.1 Predictive risk selection procedure

3.2 Properties of the optimal regularization parameter

Theorem 2

Proof

Figure 3.

4 Bayesian interpretation

Proposition 2

Proof

5 Decision-theoretic risk properties

5.1 Asymptotic properties

Theorem 3

Proof

5.2 Finite sample performance

Figure 4.

6 Application to portfolio selection

6.1 Minimum variance portfolio rebalancing

6.2 Empirical out-of-sample performance

Performance metrics

Comparison results

Figure 5.

Supplementary Material

Acknowledgments

Appendix

Proof of Lemma 1

Proof of Theorem 1

Proof of Proposition 1

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Condition Number Regularized Covariance Estimation^{^*}

3 Selection of regularization parameter κ_max