Covariance-regularized regression and classification for high-dimensional problems

Daniela M Witten; Robert Tibshirani

doi:10.1111/j.1467-9868.2009.00699.x

. Author manuscript; available in PMC: 2010 Jan 14.

Published in final edited form as: J R Stat Soc Series B Stat Methodol. 2009 Feb 20;71(3):615–636. doi: 10.1111/j.1467-9868.2009.00699.x

Covariance-regularized regression and classification for high-dimensional problems

Daniela M Witten ^1,^✉, Robert Tibshirani ²

PMCID: PMC2806603 NIHMSID: NIHMS115906 PMID: 20084176

Summary

In recent years, many methods have been developed for regression in high-dimensional settings. We propose covariance-regularized regression, a family of methods that use a shrunken estimate of the inverse covariance matrix of the features in order to achieve superior prediction. An estimate of the inverse covariance matrix is obtained by maximizing its log likelihood, under a multivariate normal model, subject to a constraint on its elements; this estimate is then used to estimate coefficients for the regression of the response onto the features. We show that ridge regression, the lasso, and the elastic net are special cases of covariance-regularized regression, and we demonstrate that certain previously unexplored forms of covariance-regularized regression can outperform existing methods in a range of situations. The covariance-regularized regression framework is extended to generalized linear models and linear discriminant analysis, and is used to analyze gene expression data sets with multiple class and survival outcomes.

Keywords: regression, classification, n ≪ p, covariance regularization

1. Introduction

In high-dimensional regression problems, where p, the number of features, is nearly as large as, or larger than, n, the number of observations, ordinary least squares regression does not provide a satisfactory solution. A remedy for the shortcomings of least squares is to modify the sum of squared errors criterion used to estimate the regression coefficients, using penalties that are based on the magnitudes of the coefficients:

\hat{β} = arg min_{β} ∣ ∣ y - X β ∣ ∣^{2} + λ_{1} ∣ ∣ β ∣ ∣^{p_{1}} + λ_{2} ∣ ∣ β ∣ ∣^{p_{2}}

(1)

(Here, the notation ||β||^s is used to indicate $\sum_{i = 1}^{p} ∣ β_{i} ∣^{s}$ .) Many popular regularization methods fall into this framework. For instance, when λ₂ = 0, p₁ = 0 gives best subset selection, p₁ = 2 gives ridge regression (Hoerl & Kennard 1970), and p₁ = 1 gives the lasso (Tibshirani 1996). More generally, for λ₂ = 0 and p₁ ≥ 0, the above equation defines the bridge estimators (Frank & Friedman 1993). Equation 1 defines the elastic net (up to a scaling) in the case that p₁ = 1 and p₂ = 2 (Zou & Hastie 2005). In this paper, we present a new approach to regularizing linear regression that involves applying a penalty not to the sum of squared errors, but rather to the log likelihood of the inverse covariance matrix under a multivariate normal model.

The least squares solution is β̂ = (X^TX)⁻¹X^Ty. In multivariate normal theory, the entries of (X^TX)⁻¹ that equal zero correspond to pairs of variables that have no partial correlation; in other words, pairs of variables that are conditionally independent, given all of the other features in the data. Non-zero entries of (X^TX)⁻¹ correspond to nonzero partial correlations. One way to perform regularization of least squares regression is to shrink the matrix (X^TX)⁻¹; in fact, this is done by ridge regression, since the ridge solution can be written as β̂_ridge = (X^TX + λI)⁻¹X^Ty. Here, we propose a more general approach to shrinkage of the inverse covariance matrix. Our method involves estimating a regularized inverse covariance matrix by maximizing its log likelihood under a multivariate normal model, subject to a constraint on its elements. In doing this, we attempt to distinguish between variables that truly are partially correlated with each other and variables that in fact have zero partial correlation. We then use this regularized inverse covariance matrix in order to obtain regularized regression coefficients. We call the class of regression methods defined by this procedure the scout.

In Section 2, we present the scout criteria and explain the method in greater detail. We also discuss connections between the scout and pre-existing regression methods. In particular, we show that ridge regression, the lasso, and the elastic net are special cases of the scout. In addition, we present some specific members of the scout class that perform well relative to pre-existing methods in a variety of situations. In Sections 3, 4, and 5, we demonstrate the use of these methods in regression, classification, and generalized linear model settings on simulated data and on a number of gene expression data sets.

2. The Scout Method

2.1. The General Scout Family

Let X = (x₁,…, x_p) denote an n × p matrix of data, where n is the number of observations and p the number of features. Let y denote a vector of length n, containing a response value for each observation. Assume that the columns of X are standardized, and that y is centered. We can create a matrix X̃ = (X y), which has dimension n × (p+1). If we assume that X̃ is generated from the model N(0, Σ), then we can find the maximum likelihood estimator of the population inverse covariance matrix Σ⁻¹ by maximizing

log (det \sum^{- 1}) - tr (S \sum^{- 1})

(2)

where $S = (\begin{matrix} S_{xx} & S_{xy} \\ {S_{xy}}^{T} & S_{y y} \end{matrix})$ is the empirical covariance matrix of X̃. Assume for a moment that S is invertible. Then, the maximum likelihood estimator for Σ⁻¹ is S⁻¹ (we use the fact that $\frac{d}{d W}$ log det W = W^-1 for a symmetric positive definite matrix W). Let $Θ = (\begin{matrix} Θ_{xx} & Θ_{xy} \\ {Θ_{xy}}^{T} & Θ_{y y} \end{matrix})$ denote a symmetric estimate of Σ⁻¹. The problem of regressing y onto X is closely related to the problem of estimating Σ⁻¹, since the least squares coefficients for the regression equal $- \frac{Θ_{xy}}{Θ_{y y}}$ for Θ = S⁻¹ (this follows from the partitioned inverse formula). If p > n, then some type of regularization is needed in order to estimate the regression coefficients, since S is not invertible. Even if p < n, we may want to shrink the least squares coefficients in some way in order to achieve superior prediction. The connection between estimation of Θ and estimation of the least squares coefficients suggests the possibility that rather than shrinking the coefficients β by applying a penalty to the sum of squared errors for the regression of y onto X, as is done in e.g. ridge regression or the lasso, we can obtain shrunken β estimates through maximization of the penalized log likelihood of the inverse covariance matrix Σ⁻¹.

To do this, one could estimate Σ⁻¹ as Θ that maximizes

log (det Θ) - tr (S Θ) - J (Θ)

(3)

where J(Θ) is a penalty function. For example, J(Θ) = ||Θ||^p denotes the sum of absolute values of the elements of Θ if p = 1, and it denotes the sum of squared elements of Θ if p = 2. Our regression coefficients would then be given by the formula $β = - \frac{Θ_{xy}}{Θ_{y y}}$ . However, recall that if X̃ ~ N (0, Σ), then the ij element of Θ gives the correlation of x̃_i with x̃_j, conditional on all of the other variables in X̃. Note that y is included in X̃. So it does not make sense to regularize the elements of Θ as presented above, because we really care about the partial correlations of pairs of variables given the other variables, as opposed to the partial correlations of pairs of variables given the other variables and the response.

For these reasons, rather than obtaining an estimate of Σ⁻¹ by maximizing the penalized log likelihood in Equation 3, we estimate it via a two-stage maximization, given in the following algorithm:

The Scout Procedure for General Penalty Functions

Compute Θ̂_xx, which maximizes

$log (det Θ_{xx}) - tr (S_{xx} Θ_{xx}) - J_{1} (Θ_{xx})$ (4)
Compute Θ̂, which maximizes

$log (det Θ) - tr (S Θ) - J_{2} (Θ)$ (5)

where the top left p × p submatrix of Θ̂ is constrained to equal Θ̂_xx, the solution to Step 1.
Compute β̂, defined by $\hat{β} = - \frac{{\hat{Θ}}_{xy}}{{\hat{Θ}}_{y y}}$ .
Compute β̂* = cβ̂, where c is the coefficient for the regression of y onto Xβ̂.

β̂* denotes the regularized coefficients obtained using this new method. Step 1 of the Scout Procedure involves obtaining shrunken estimates of (Σ_xx)⁻¹ in order to smooth our estimates of which variables are conditionally independent. Step 2 involves obtaining shrunken estimates of Σ⁻¹, conditional on (Σ⁻¹)_xx = Θ̂_xx, the estimate obtained in Step 1. Thus, we obtain regularized estimates of which predictors are dependent on y, given all of the other predictors. The scaling in the last step is performed because it has been found, empirically, to improve performance.

By penalizing the entries of the inverse covariance matrix of the predictors in Step 1 of the Scout Procedure, we are attempting to distinguish between pairs of variables that truly are conditionally dependent, and pairs of variables that appear to be conditionally dependent due only to chance. We are searching, or scouting, for variables that truly are correlated with each other, conditional on all of the other variables. Our hope is that sets of variables that truly are conditionally dependent will also be related to the response. In the context of a microarray experiment, where the variables are genes and the response is some clinical outcome, this assumption is reasonable: we seek genes that are part of a pathway related to the response. One expects that such genes will also be conditionally dependent. In Step 2, we shrink our estimates of the partial correlation between each predictor and the response, given the shrunken partial correlations between the predictors that we estimated in Step 1. In contrast to ordinary least squares regression, which uses the inverse of the empirical covariance matrix to compute regression coefficients, we jointly model the relationship that the p predictors have with each other and with the response in order to obtain shrunken regression coefficients.

We define the scout family of estimated coefficients for the regression of y onto X as the solutions β̂* obtained in Step 4 of the Scout Procedure. We refer to the penalized log likelihoods in Steps 1 and 2 of the Scout Procedure as the first and second scout criteria.

In the rest of the paper, when we discuss properties of the scout, for ease of notation we will ignore the scale factor in Step 4 of the Scout Procedure. For instance, if we claim that two procedures yield the same regression coefficients, we more specifically mean that the regression coefficients are the same up to scaling by a constant factor.

Least squares, the elastic net, the lasso, and ridge regression result from the scout procedure with appropriate choices of J₁ and J₂ (up to a scaling by a constant). Details are in Table 1. The first two results can be shown directly by differentiating the scout criteria, and the others follow from Equation 11 in Section 2.4.

Table 1.

Special cases of the scout.

J₁(Θxx)	J₂(Θ)	Method
0	0	Least Squares
tr(Θxx)	0	Ridge Regression
tr(Θxx)	\|\|Θ\|\|¹	Elastic Net
0	\|\|Θ\|\|¹	Lasso
0	\|\|Θ\|\|²	Ridge Regression

Open in a new tab

2.2. L_p Penalties

Throughout the remainder of this paper, we will exclusively be interested in the case that J₁(Θ) = λ₁||Θ||^p₁ and $J_{2} (Θ) = \frac{λ_{2}}{2} ∣ ∣ Θ ∣ ∣^{p_{2}}$ , where the norm is taken elementwise over the entries of Θ, and where λ₁, λ₂ ≥ 0. For ease of notation, Scout(p₁, p₂) will refer to the solution to the scout criterion with J₁ and J₂ as just mentioned. If λ₂ = 0, then this will be indicated by Scout(p₁, ·), and if λ₁ = 0, then this will be indicated by Scout(·, p₂). Therefore, in the rest of this paper, the Scout Procedure will be as follows:

The Scout Procedure with L_p Penalties

Compute Θ̂_xx, which maximizes

$log (det Θ_{xx}) - tr (S_{xx} Θ_{xx}) - λ_{1} ∣ ∣ Θ_{xx} ∣ ∣^{p_{1}}$ (6)
Compute Θ̂, which maximizes

$log (det Θ) - tr (S Θ) - \frac{λ_{2}}{2} ∣ ∣ Θ ∣ ∣^{p_{2}}$ (7)

where the top left p × p submatrix of Θ̂ is constrained to equal Θ̂_xx, the solution to Step 1. Note that because of this constraint, the penalty really is only being applied to the last row and column of Θ̂.
Compute β̂, defined by $\hat{β} = - \frac{{\hat{Θ}}_{xy}}{{\hat{Θ}}_{y y}}$ .
Compute β̂* = cβ̂, where c is the coefficient for the regression of y onto Xβ̂.

(Note that in Step 2, because the top left p × p submatrix of Θ̂ is fixed, the penalty on the top left p × p elements has no effect).

2.3. Simple Example

Here, we present a toy example in which n = 20 observations on p = 19 variables are generated under the model y = Xβ + ε, where β_j = j for j ≤ 10 and β_j = 0 for j > 10, and where ε ~ N (0, 25). In addition, the first 10 variables have correlation 0.5 with each other; the rest are uncorrelated.

Figure 1 shows the average over 500 simulations of the following four quantities: the true value of β, the least squares regression coefficients, the estimate Σ⁻¹Cov(X, y) where Σ is the true covariance matrix of X in the underlying model, and the Scout(1, ·) regression estimate. It is not surprising that least squares performs poorly in this situation, since N is barely larger than p. Scout(1, ·) performs extremely well; though it results in coefficient estimates that are slightly biased, they have much less variance than the estimates obtained using the true covariance matrix. This simple example demonstrates that benefits can result from the use of a shrunken estimate of the inverse covariance matrix.

Fig. 1 — Data were generated under a simple model; results shown are averaged over 500 simulations. Standard error bars and coefficient estimates are shown. Clockwise from the top left, panels show the true value of β, Σ⁻¹Cov(X, y), (X^TX)⁻¹X^Ty, and *Scout* (1,·).

2.4. Minimization of the Scout Criteria with L_p Penalties

If λ₁ = 0, then the minimum of the first scout criterion is given by (S_xx)⁻¹ (if S_xx is invertible). In the case that λ₁ > 0 and p₁ = 1, minimization of the first scout criterion has been studied extensively; see e.g. Meinshausen & Bühlmann (2006). The solution can be found via the “graphical lasso”, an efficient algorithm given by Banerjee et al. (2008) and Friedman et al. (2008b) that involves iteratively regressing one row of the estimated covariance matrix onto the others, subject to an L₁ constraint, in order to update the estimate for that row.

If λ₁ > 0 and p₁ = 2, the solution to Step 1 of the Scout Procedure is even easier. We want to find θ_xx that maximizes

log (det Θ_{xx}) - tr (S_{xx} Θ_{xx}) - λ ∣ ∣ Θ_{xx} ∣ ∣^{2}

(8)

Differentiating with respect to θ_xx, we see that the maximum solves

{Θ_{xx}}^{- 1} - 2 λ Θ_{xx} = S_{xx}

(9)

This equation implies that θ_xx and S_xx share the same eigenvectors. Letting λ_i denote the i^th eigenvalue of θ_xx and letting s_i denote the i^th eigenvalue of S_xx, it is clear that

\frac{1}{θ_{i}} - 2 λ θ_{i} = s_{i}

(10)

We can easily solve for λ_i, and can therefore solve the first scout criterion exactly in the case p₁ = 2, in essentially just the computational cost of obtaining the eigenvalues of S_xx.

It turns out that if p₂ = 1 or p₂ = 2, then it is not necessary to minimize the second scout criterion directly, as there is an easier alternative:

Claim 1

For p₂ ∈ {1, 2}, the solution to Step 3 of the Scout Procedure is equal to the solution to the following, up to scaling by a constant:

\hat{β} = arg min_{β} {β^{T} {\sum^{^}}_{xx} β - 2 {S_{xy}}^{T} β + λ_{2} ∣ ∣ β ∣ ∣^{p_{2}}}

(11)

where Σ̂_xx is the inverse of the solution to Step 1 of the Scout Procedure.

(The proof of Claim 1 is in Section 8.1.1 in the Appendix.) Therefore, we can replace Steps 2 and 3 of the Scout Procedure with an L_p₂ regression. It is trivial to show that if λ₂ = 0 in the Scout Procedure, then the scout solution is given by β̂ = (Σ̂_xx)⁻¹S_xy. It also follows that if λ₁ = 0, then the cases λ₂ = 0, p₂ = 1, and p₂ = 2 correspond to ordinary least squares regression (if the empirical covariance matrix is invertible), the lasso, and ridge regression, respectively.

In addition, we will show in Section 2.5.1 that if p₁ = 2 and p₂ = 1, then the scout can be re-written as an elastic net problem with slightly different data; therefore, fast algorithms for solving the elastic net (Friedman et al. 2008a) can be used to solve Scout(2, 1). The methods for minimizing the scout criteria are summarized in Table 2.

Table 2.

Minimization of the scout criteria: special cases. The scout criteria can be easily minimized if λ₁ = 0 or p₁ ∈ {1, 2}, and if λ₂ = 0 or p₂ ∈ {1, 2}.

	λ₂ = 0	p₂ = 1	p₂ = 2
λ₁ = 0	Least Squares	L₁ Regression	L₂ Regression
p₁ = 1	Graphical Lasso	Graphical Lasso + L₁ Reg.	Graphical Lasso + L₂ Reg.
p₁ = 2	Eigenvalue Problem	Elastic Net	Eigenvalue Problem + L₂ Reg.

Open in a new tab

We compared computation times for Scout(2, ·), Scout(1, ·), Scout(2, 1), and Scout(1, 1) on an example with n = 100, λ₂ = λ₁ = 0.2, and X dense. All timings were carried out on a Intel Xeon 2.80 GHz processor. Table 3 shows the number of CPU seconds required for each of these methods for a range of values of p. For all methods, after the scout coefficients have been estimated for a given set of parameter values, estimation for different parameter values is faster due to a warm start (when p₁ = 1) or because the eigen decomposition has already been computed (when p₁ = 2).

Table 3.

Timing comparisons for minimization of the scout criteria. The numbers of CPU seconds required to run four versions of the scout are shown, for λ₁ = λ₂ = 0.2, n = 100, X dense, and various values of p.

p	Scout(1, ·)	Scout(1, 1)	Scout(2, ·)	Scout(2, 1)
500	1.685	1.700	0.034	0.072
1000	22.432	22.504	0.083	0.239
2000	241.289	241.483	0.260	0.466

Open in a new tab

2.5. Properties of the Scout

In this section, for ease of notation, we will consider an equivalent form of the Scout Procedure obtained by replacing S_xx with X^T X and S_xy with X^T y.

2.5.1. Similarities between Scout, Ridge Regression, and the Elastic Net

Let $U_{n \times p} D_{p \times p} V_{p \times p}^{T}$ denote the singular value decomposition of X with d_i the i^th diagonal element of D and d₁ ≥ d₂ ≥ … ≥ d_r > d_r₊₁ = … = d_p = 0, where r = rank(X) ≤ min(n, p). Consider Scout(2, p₂). As previously discussed, the first step in the Scout Procedure corresponds to finding θ that solves

Θ^{- 1} - 2 λ_{1} Θ = X^{T} X

(12)

Since θ and X^TX therefore share the same eigenvectors, it follows that θ⁻¹ = V(D² + D̃²)V^T where D̃² is a p×p diagonal matrix with i^th diagonal entry equal to $\frac{1}{2} (- {d_{i}}^{2} + \sqrt{{d_{i}}^{4} + 8 λ_{1}})$ . It is not difficult to see that ridge regression, Scout(2, ·), and Scout(2, 2) result in similar regression coefficients:

\begin{array}{l} {\hat{β}}_{r r} = {(V (D^{2} + c I) V^{T})}^{- 1} X^{T} y \\ {\hat{β}}_{scout (2, \cdot)} = {(V (D^{2} + {\tilde{D}}^{2}) V^{T})}^{- 1} X^{T} y \\ {\hat{β}}_{scout (2, 2)} = {(V (D^{2} + {\tilde{D}}^{2} + λ_{2} I) V^{T})}^{- 1} X^{T} y \end{array}

(13)

Therefore, while ridge regression simply adds a constant to the diagonal elements of D in the least squares solution, Scout(2, ·) instead adds a function that is monotone decreasing in the value of the diagonal element. (The consequences of this alternative shrinkage are explored under a latent variable model in Section 2.6). Scout(2, 2) is a compromise between Scout(2, ·) and ridge regression.

In addition, we note that the solutions to the naive elastic net and Scout(2, 1) are quite similar:

\begin{array}{l} {\hat{β}}_{enet} = arg min_{β} β^{T} V (D^{2} + c I) V^{T} β - 2 β^{T} X^{T} y + λ_{2} ∣ ∣ β ∣ ∣^{1} \\ {\hat{β}}_{scout (2, 1)} = arg min_{β} β^{T} V (D^{2} + {\tilde{D}}^{2}) V^{T} β - 2 β^{T} X^{T} y + λ_{2} ∣ ∣ β ∣ ∣^{1} \end{array}

(14)

In fact, both solutions can be re-written:

\begin{array}{l} {\hat{β}}_{enet} = arg min_{β} β^{T} X^{T} X β - 2 β^{T} X^{T} y + c ∣ ∣ β ∣ ∣^{2} + λ_{2} ∣ ∣ β ∣ ∣^{1} \\ {\hat{β}}_{scout (2, 1)} = arg min_{β} β^{T} V ({\hat{D}}^{2} + \sqrt{2 λ_{1}} I) V^{T} β - 2 β^{T} X^{T} y + λ_{2} ∣ ∣ β ∣ ∣^{1} \\ = arg min_{β} β^{T} V {\hat{D}}^{2} V^{T} β - 2 β^{T} X^{T} y + \sqrt{2 λ_{1}} ∣ ∣ β ∣ ∣^{2} + λ_{2} ∣ ∣ β ∣ ∣^{1} \\ = arg min_{β} β^{T} X^{* T} X^{*} β - 2 β^{T} X^{* T} y^{*} + \sqrt{2 λ_{1}} ∣ ∣ β ∣ ∣^{2} + λ_{2} ∣ ∣ β ∣ ∣^{1} \end{array}

(15)

where D̂ is a diagonal matrix with diagonal elements $\frac{1}{2} (d_{i}^{2} + \sqrt{d_{i}^{4} + 8 λ_{1}}) - \sqrt{2 λ_{1}}$ , X*= D̂₍_r₎V₍_r₎^T, $y^{*} = {\hat{D}}_{(r)}^{- 1} D_{(r)} {U_{(r)}}^{T} y$ . D₍_r₎ and D̂₍_r₎ are the r × r submatrices of D and D̂ corresponding to non-zero diagonal elements, and U₍_r₎ and V₍_r₎ correspond to the first r columns of U and V. From Equation 15, it is clear that Scout(2, 1) solutions can be obtained using software for the elastic net on data X* (which has dimension no greater than the original data X) and y*. In addition, given the similarity between the elastic net and Scout(2, 1) solutions, it is not surprising that Scout(2, 1) shares some of the elastic net’s desirable properties, as is shown in Section 2.5.2.

2.5.2. Variable Grouping Effect

Zou & Hastie (2005) show that unlike the lasso, the elastic net and ridge regression have a variable grouping effect: correlated variables result in similar coefficients. The same is true of Scout(2, 1):

Claim 2

Assume that the predictors are standardized and that y is centered. Let ρ denote the correlation between x_i and x_j, and let β̂ denote the solution to Scout(2, 1). If sgn(β̂_i) = sgn(β̂_j), then the following holds:

∣ {\hat{β}}_{i} - {\hat{β}}_{j} ∣ \leq \sqrt{\frac{2 (1 - ρ)}{λ_{1}}} ∣ ∣ y ∣ ∣

(16)

The proof of Claim 2 is in Section 8.1.2 in the Appendix. Similar results hold for Scout(2, ·) and Scout(2, 2), without the assumptions about the signs of β̂_i and β̂_j.

2.5.3. Connections to Regression with Orthogonal Features

Assume that the features are standardized, and consider the scout criterion with p₁ = 1. For λ₁ sufficiently large, the solution θ̂_xx to the first scout criterion (Equation 6) is a diagonal matrix with diagonal elements $\frac{1}{λ_{1} + x_{i}^{T} x_{i}}$ . (More specifically, if $λ_{1} \geq ∣ x_{i}^{T} x_{j} ∣$ for all i ≠ j, then then the scout criterion with p₁ = 1 results in a diagonal matrix; see Banerjee et al. (2008) Theorem 4). Thus, if ${\hat{β}}_{i}^{*}$ is the i^th component of the Scout(1, ·) solution, then ${\hat{β}}_{i}^{*} = \frac{x_{i}^{T} y}{λ_{1} + 1}$ . If λ₂ > 0, then the resulting scout solutions with p₂ = 1 are given by a variation of the univariate soft thresholding formula for L₁ regression:

{\hat{β}}_{i}^{*} = \frac{1}{λ_{1} + 1} sgn (x_{i}^{T} y) max (0, ∣ x_{i}^{T} y ∣ - \frac{λ_{2}}{2})

(17)

Similarly, if p₂ = 2, the resulting scout solutions are given by the following formula:

{\hat{β}}^{*} = {(1 + λ_{1} + λ_{2})}^{- 1} X^{T} y

(18)

Therefore, as the parameter λ₁ is increased, the solutions that are obtained range (up to a scaling) from the ordinary L_p₂ multivariate regression solution to the regularized regression solution for orthonormal features.

2.6. An Underlying Latent Variable Model

Let X be a n × p matrix of n observations on p variables, and y a n × 1 vector of response values. Suppose that X and y are generated under the following latent variable model:

\begin{array}{l} X = d_{1} u_{1} v_{1}^{T} + d_{2} u_{2} v_{2}^{T} \\ d_{1}, d_{2} > 0 \\ y = u_{1} + ε \end{array}

(19)

where u_i and v_j are the singular vectors of X, and ε is a n × 1 vector of noise.

Claim 3

Under this model, if d₁ > d₂, then Scout(2, ·) results in estimates of the regression coefficients that have lower variance than those obtained via ridge regression.

A more technical explanation of Claim 3, as well as a proof, are given in Section 8.1.3 of the Appendix. Note that a simple example of the above model would be the case of a block diagonal covariance matrix with two blocks, where one of the blocks of correlated features is associated with the outcome. In the case of gene expression data, these blocks could represent gene pathways, one of which is responsible for, and has expression that is correlated with, the outcome. Claim 3 shows that if the signal associated with the relevant gene pathway is sufficiently large, then Scout(2, ·) will provide a benefit over ridge.

3. Numerical Studies: Regression via the Scout

3.1. Simulated Data

We compare the performance of ordinary least squares, the lasso, the elastic net, Scout(2, 1), and Scout(1, 1) on a suite of five simulated examples. The first four simulations are based on those used in the original elastic net paper (Zou & Hastie 2005) and the original lasso paper (Tibshirani 1996). The fifth is of our own invention. All five simulations are based on the model y = Xβ + σε where ε ~ N(0, 1). For each simulation, each data set consists of a small training set, a small validation set (used to select the values of the various parameters) and a large test set. We indicate the size of the training, validation, and test sets using the notation ·/·/·. The five simulations are as follows:

Each data set consists of 20/20/200 observations, 8 predictors with coefficients β = (3, 1.5, 0, 0, 2, 0, 0, 0), and σ = 3. The pairwise correlation between x_i and x_j is 0.5^| ⁱ ⁻ ^j ^|.
This simulation is as in Simulation 1, except that β_i = 0.85 for all i.
Each data set consists of 100/100/400 observations and 40 predictors. β_i = 0 for i ∈ 1, …, 10 and for i ∈ 21, …, 30; for all other i, β_i = 2. We also set σ = 15 and the correlation between all pairs of predictors was 0.5.
Each data set consists of 50/50/400 observations and 40 predictors. β_i = 3 for i ∈ 1, …, 15 and β_i = 0 for i ∈ 16, …, 40, and σ = 15. The predictors are generated as follows:

$\begin{array}{l} x_{i} = z_{1} + ε_{i}^{x}, z_{1} \sim N (0, 1), i = 1, \dots, 5 \\ x_{i} = z_{2} + ε_{i}^{x}, z_{2} \sim N (0, 1), i = 6, \dots, 10 \\ x_{i} = z_{3} + ε_{i}^{x}, z_{3} \sim N (0, 1), i = 11, \dots, 15 \end{array}$ (20)

Also, x_i ~ N (0, 1) are independent and identically distributed for i = 16, …, 40, and $ε_{i}^{x} \sim N (0, 0.01)$ are independent and identically distributed for i = 1, …, 15.
Each data set consists of 50/50/400 observations and 50 predictors; β_i = 2 for i < 9 and β_i = 0 for i ≥ 9. σ = 2 and Cor(x_i, x_j) = .5 · 1_i,j _{≤ 9}.

For each simulation, 200 data sets were generated, and the median mean squared errors (with standard errors given in parentheses) are given in Table 4. For each simulation, the two methods resulting in lowest median mean squared error are shown in bold. The scout provides an improvement over the lasso in all simulations. Both scout methods result in lower mean squared error than the elastic net in Simulations 2, 3, and 5; in Simulations 1 and 4, the scout methods are quite competitive. Table 5 shows median L₂ distances between the true and estimated coefficients for each of the five models.

Table 4.

Median mean squared error over 200 simulated data sets is shown for each simulation. Standard errors are given in parentheses. For each simulation, the two methods with lowest median mean squared errors are shown in bold. Least squares was not performed for Simulation 5, because p = n.

Simulation	Least Squares	Lasso	ENet	Scout(1, 1)	Scout(2, 1)
Sim 1	5.83(0.43)	2.30(0.16)	1.77(0.20)	1.71(0.13)	1.85(0.14)
Sim 2	5.83(0.43)	2.84(0.10)	1.90(0.10)	0.89(0.08)	1.15(0.10)
Sim 3	147.14(3.63)	42.03(0.91)	30.79(0.61)	20.11(0.16)	18.22(0.27)
Sim 4	961.57(42.82)	46.44(2.14)	20.49(1.97)	23.15(1.62)	23.70(1.89)
Sim 5	NA	1.32(0.06)	0.55(0.02)	0.27(0.02)	0.52(0.04)

Open in a new tab

Table 5.

Median L₂ distance over 200 simulated data sets is shown for each simulation; details are as in Table 4.

Simulation	Least Squares	Lasso	ENet	Scout(1, 1)	Scout(2, 1)
Sim 1	3.05(0.10)	1.74(0.05)	1.65(0.08)	1.58(0.05)	1.62(0.06)
Sim 2	3.05(0.10)	1.95(0.02)	1.62(0.03)	0.90(0.03)	1.04(0.04)
Sim 3	17.03(0.22)	8.91(0.09)	7.70(0.06)	6.15(0.01)	5.83(0.03)
Sim 4	168.40(5.13)	17.40(0.16)	3.85(0.13)	5.19(2.3)	3.80(0.14)
Sim 5	NA	1.23(0.04)	1.03(0.03)	0.62(0.03)	0.89(0.02)

Open in a new tab

3.2. Making Use of Observations without Response Values

In Step 1 of the Scout Procedure, we estimate the inverse covariance matrix based on the training set X data, and in Steps 2–4, we compute a penalized least squares solution based on that estimated inverse covariance matrix and Cov(X, y). Step 1 of this procedure does not involve the response y at all.

Now, consider a situation in which one has access to a large amount of X data, but responses are known for only some of the observations. (For instance, this could be the case for a medical researcher who has clinical measurements on hundreds of cancer patients, but survival times for only dozens of patients.) More specifically, let X₁ denote the observations for which there is an associated response y, and let X₂ denote the observations for which no response data is available. Then, one could estimate the inverse covariance matrix in Step 1 of the Scout Procedure using both X₁ and X₂, and perform Step 2 using Cov(X₁, y). By also using X₂ in Step 1, we achieve a more accurate estimate of the inverse covariance matrix than would have been possible using only X₁.

Such an approach will not provide an improvement in all cases. For instance, consider the trivial case in which the response is a linear function of the predictors, p < n, and there is no noise: y = X₁β. Then, the least squares solution, using only X₁ and not X₂, is β̂ = (X₁^TX₁)⁻¹X₁^Ty = (X₁^TX₁)⁻¹X₁^TX₁β = β. In this case, it clearly is best to only use X₁ in estimating the inverse covariance matrix. However, one can imagine situations in which one can use X₂ to obtain a more accurate estimate of the inverse covariance matrix.

Consider a model in which a latent variable has generated some of the features, as well as the response. In particular, suppose that the data are generated as follows:

\begin{array}{l} x_{i j} = 2 u_{i} + ε_{i j}, j = 1, \dots, 5, i = 1, \dots, n \\ x_{i j} = ε_{i j}, j = 6, \dots, 10, i = 1, \dots, n \\ y_{i} = 8 u_{i} + 4 ε_{i}^{'}, i = 1, \dots, n \end{array}

(21)

In addition, we let ε_ij, $ε_{i}^{'}$ , u_i ~ N (0, 1) i.i.d. The first five variables are “signal” variables, and the rest are “noise” variables. Suppose that we have three sets of observations: a training set of size n = 12, for which the y values are known, a test set of size n = 200, for which we wish to predict the y values, and an additional set of size n = 36 observations for which we do not know the y values and do not wish to predict them. This layout is shown in Table 6.

Table 6.

Making use of observations w/o response values: Set-up. The training set consists of 12 observations and associated responses. We wish to predict responses on a test set of 200 observations. We have access to 36 observations for which responses are not available and not of interest.

	Sample Size	Response Description
Training Set	12	Available
Test Set	200	Unavailable - Must be predicted
Additional Obs.	36	Unavailable - Not of interest

Open in a new tab

We compare the performances of the scout and other regression methods. The scout method is applied in two ways: using only the training set X values to estimate the inverse covariance matrix, and using also the observations without response values. All tuning parameter values are chosen by 5-fold cross-validation. The results in Table 7 are the average mean squared prediction errors obtained over 500 simulations. From the table, it is clear that both versions of scout outperform all of the other methods. In addition, using observations that do not have response values does result in a significant improvement.

Table 7.

Making use of observations w/o response values: Results. Mean squared prediction errors are averaged over 500 simulations; standard errors are shown in parentheses.

Method	Mean Squared Prediction Error
Scout(1, ·) w/Additional Obs.	25.65 (0.38)
Scout(1, ·) w/o Additional Obs.	29.92 (0.62)
ENet	32.38 (1.04)
Lasso	47.24 (3.58)
Σ⁻¹Cov(X, y)	86.66 (2.07)
Least Squares	1104.9 (428.84)
Null Model	79.24 (0.3)

Open in a new tab

In this example, twelve labeled observations on ten variables do not suffice to reliably estimate the inverse covariance matrix. The scout can make use of the observations that lack response values in order to improve the estimate of the inverse covariance matrix, thereby yielding superior predictions. It is worth noting that in this exaple, the formula β̂ = Σ⁻¹Cov(X₁, y) (where Σ is the true covariance matrix) yields an average prediction error that is higher than that of the null model. Therefore, it is clear that in this example, shrinkage is necessary to achieve good predictions.

4. Classification via the Scout

In classification problems, linear discriminant analysis (LDA) can be used if n > p, but when p > n, regularization of the within-class covariance matrix is necessary. Regularized linear discriminant analysis is discussed in Friedman (1989) and Guo et al. (2007). In Guo et al. (2007), the within-class covariance matrix is shrunken, as in ridge regression, by adding a multiple of the identity matrix to the empirical covariance matrix. Here, we instead estimate a shrunken within-class inverse covariance matrix by maximizing its log likelihood, under a multivariate normal model, subject to an L_p penalty on its elements.

4.1. Details of Extension of Scout to Classification

Consider a classification problem with K classes; each observation belongs to some class k ∈ 1, …, K. Let C(i) denote the class of training set observation i, which is denoted X_i. Our goal is to classify observations in an independent test set.

Let μ̂_k denote the p × 1 vector that contains the mean of observations in class k, and let $S_{wc} = \frac{1}{n - K} \sum_{k = 1}^{K} \sum_{i : C (i) = k} (X_{i} - {\hat{μ}}_{k}) {(X_{i} - {\hat{μ}}_{k})}^{T}$ denote the estimated within-class covariance matrix (based on the training set) that is used for ordinary LDA. Then, the scout procedure for clasification is as follows:

The Scout Procedure for Classification

Compute the shrunken within-class inverse covariance matrix ${\sum^{^}}_{wc, λ}^{- 1}$ as follows:

${\sum^{^}}_{wc, λ}^{- 1} = arg max_{\sum^{- 1}} {log det \sum^{- 1} - tr (S_{wc} \sum^{- 1}) - λ ∣ ∣ \sum^{- 1} ∣ ∣^{s}}$ (22)

where λ is a shrinkage parameter.
Classify test set observation X to class k′ if $k^{'} = arg {max}_{k} δ_{k}^{λ} (X)$ , where

$δ_{k}^{λ} (X) = X^{T} {\sum^{^}}_{wc, λ}^{- 1} {\hat{μ}}_{k} - \frac{1}{2} {\hat{μ}}_{k}^{T} {\sum^{^}}_{wc, λ}^{- 1} {\hat{μ}}_{k} + log π_{k}$ (23)

and π_k is the frequency of class k in the training set.

This procedure is analogous to LDA, but we have replaced S_wc with a shrunken estimate.

This classification rule performs quite well on real microarray data (as is shown below), but has the drawback that it results in a classification rule that makes use of all of the genes. We can remedy this in one of two ways. We can apply the method described above to only the genes with highest univariate rankings on the training data; this is done in the next section. Alternatively, we can apply an L₁ penalty in estimating the quantity ${\sum^{^}}_{wc, λ}^{- 1} {\hat{μ}}_{k}$ ; note (from Equation 23) that sparsity in this quantity will result in a classification rule that is sparse in the features. Details of this second method, which is not implemented here, are given in Section 8.2 of the Appendix. We will refer to the method detailed in Equation 23 as Scout(s, ·) because the penalized log likelihood that is maximized in Equation 22 is analogous to the first scout criterion in the regression case. The tuning parameter λ in Equations 22 and 23 can be chosen via cross-validation.

4.2. Ramaswamy Data

We assess the performance of this method on the Ramaswamy microarray data set, which is discussed in detail in Ramaswamy et al. (2002) and explored further in Zhu & Hastie (2004) and Guo et al. (2007). It consists of a training set of 144 samples and a test set of 54 samples, each of which contains measurements on 16063 genes. The samples are classified into 14 distinct cancer types. We compare the performance of Scout(2, ·) to nearest shrunken centroids (NSC) (Tibshirani et al. (2002) and Tibshirani et al. (2003)), L₂ penalized multinomial (Zhu & Hastie 2004), the support vector machine (SVM) with one-versus-all classification (Ramaswamy et al. 2002), and regularized discriminant analysis (RDA) (Guo et al. 2007). For each method, tuning parameter values were chosen by cross-validation. In addition to running Scout(2, ·) on all 16063 genes, we also ran it on only the genes with highest univariate t-statistics (Tibshirani et al. 2002) in the training set. In the latter case, cross-validation was performed in order to select the number of genes to include in the model. (The model with 4000 genes had lowest cross-validation error). Note that the selection of genes with highest t-statistics was performed separately in each training fold during cross-validation.

The results can be seen in Table 8. Scout(2, ·) performed on the 4000 genes with highest training set t-statistics had the lowest test error rate out of all of the methods that we considered.

Table 8.

Cross-validation and test set errors for the following methods are compared on the Ramaswamy Data: Scout(2, ·), Scout(2, ·) using the genes with highest t-statistics on the training set, regularized discriminant analysis, nearest shrunken centroids, support vector machine using one-versus-all classification, and the L₂ penalized multinomial. With the exception of RDA, all methods were performed after cube roots of the data were taken and the patients were standardized. RDA was run on the standardized patients, without taking cube roots, as this led to much better performance.

Method	CV Err. (of 144)	Test Err. (of 54)	No. Genes
NSC	35	17	5217
L₂ Penalized Mult.	29	15	16063
SVM	33	14	16063
RDA	34	10	16063
Scout(2, ·)	38	11	16063
Scout(2, ·) High T-stat.	21	8	4000

Open in a new tab

5. Extension to Generalized Linear Models and the Cox Model

We have discussed the application of the scout to classification and regression problems, and we have shown examples in which these methods perform well. In fact, the scout can also be used in fitting generalized linear models, by replacing the iteratively reweighted least squares step with a covariance-regularized regression. In particular, we discuss the use of the scout in the context of fitting a Cox proportional hazards model for survival data. We present an example involving four lymphoma microarray datasets in which the scout results in improved performance relative to other methods.

5.1. Details of Extension of Scout to the Cox Model

Consider survival data of the form (y_i, xⁱ, δ_i) for i ∈ 1, …, n, where δ_i is an indicator variable that equals 1 if observation i is complete and 0 if censored, and xⁱ is a vector of predictors ( $x_{1}^{i}, \dots, x_{p}^{i}$ ) for individual i. Failure times are t₁ < t₂ < … < t_k; there are d_i failures at time t_i. We wish to estimate the parameter β = (β₁, …, β_p)^T in the proportional hazards model λ(t|x) = λ_o(t)exp(Σ_j x_jβ_j). We assume that censoring is noninformative. Letting η = Xβ, D the set of indices of the failures, R_r the set of indices of the individuals at risk at time t_r, and D_r the set of indices of the failures at t_r, the partial likelihood is given as follows (see e.g. Kalbfleisch & Prentice (1980)):

L (β) = Π_{r \in D} \frac{exp (\sum_{j \in D_{r}} η_{j})}{{(\sum_{j \in R_{r}} exp (η_{j}))}^{d_{r}}}

(24)

In order to fit the proportional hazards model, we must find the β that maximizes the likelihood above. Note that $\frac{\partial l}{\partial β} = {(\frac{\partial η}{\partial β})}^{T} \frac{\partial l}{\partial η} = X^{T} \frac{\partial l}{\partial η}$ and $\frac{\partial^{2} l}{\partial β β^{T}} \approx X^{T} \frac{\partial^{2} l}{\partial η η^{T}} X$ . Let $u = \frac{\partial l}{\partial η}$ and $A = - \frac{\partial^{2} l}{\partial η η^{T}}$ . The iteratively reweighted least squares algorithm that implements the Newton-Raphson method, for β₀ the value of β from the previous step, involves finding β that solves

X^{T} AX (β - β_{0}) = X^{T} u

(25)

This is equivalent to finding β that minimizes

∣ ∣ y^{*} - X^{*} β^{*} ∣ ∣^{2}

(26)

where X^* = A^1/2X, y^* = A^−1/2u, β^* = β − β_o (Green 1984).

The traditional iterative reweighted least squares algorithm involves solving the above least squares problem repeatedly, recomputing y^* and X^* at each step and setting β₀ equal to the solution β attained at the previous iteration. We propose to solve the above equation using the scout, rather than by a simple linear regression. We have found empirically that good results are obtained if we initially set β₀ = 0, and then perform just one Newton-Raphson step (using the scout). This is convenient, since for data sets with many features, solving a scout regression can be time-consuming. Therefore, our implementation of the scout method for survival data involves simply performing one Newton-Raphson step, beginning with β₀ = 0.

Using the notation $Θ = (\begin{matrix} Θ_{xx} & Θ_{xy} \\ {Θ_{xy}}^{T} & Θ_{y y} \end{matrix})$ and $S = (\begin{matrix} X^{T} AX & X^{T} u \\ u^{T} X & u^{T} A^{- 1} u \end{matrix})$ , the Scout Procedure for survival data is almost identical to the regression case, as follows:

The Scout Procedure for the Cox Model

Let Θ̂_xx maximize

$log det Θ_{xx} - tr (S_{xx} Θ_{xx}) - λ_{1} ∣ ∣ Θ_{xx} ∣ ∣^{p_{1}}$ (27)
Let Θ̂ maximize

$log det Θ - tr (S Θ) - λ_{2} ∣ ∣ Θ ∣ ∣^{p_{2}}$ (28)

where the top p × p submatrix of Θ is constrained to equal Θ̂_xx, obtained in the previous step.
Compute $\hat{β} = - \frac{{\hat{Θ}}_{xy}}{{\hat{Θ}}_{y y}}$ .
Let β̂^* = cβ̂, where c is the coefficient of a Cox proportional hazards model fit to y using Xβ as a predictor.

β̂^* obtained in Step 4 is the vector of estimated coefficients for the Cox proportional hazards model. In the procedure above, λ₁, λ₂ > 0 are tuning parameters. In keeping with the notation of previous sections, we will refer to the resulting coefficient estimates as Scout(p₁, p₂).

5.2. Lymphoma Data

We illustrate the effectiveness of the scout method on survival data using four different data sets, all involving survival times and gene expression measurements for patients with diffuse large B-cell lymphoma. The four data sets are as follows: Rosenwald et al. (2002) (“Rosenwald”), which consists of 240 patients, Shipp et al. (2002) (“Shipp”), which consists of 58 patients, Hummel et al. (2006) (“Hummel”), which consists of 81 patients, and Monti et al. (2005) (“Monti”), which consists of 129 patients. For consistency and ease of comparison, we considered only a subset of around 1482 genes that were present in all four data sets.

We randomly split each of the data sets into a training set, a validation set, and a test set of equal sizes. For each data set, we fit four models to the training set: the L₁ penalized Cox proportional hazards (“L₁ Cox”) method of Park & Hastie (2007), the supervised principal components (SPC) method of Bair & Tibshirani (2004), Scout(2, 1), and Scout(1, 1). For each data set, we chose the tuning parameter values that resulted in the predictor that gave the highest log likelihood when used to fit a Cox proportional hazards model on the validation set (this predictor was X_valβ_train for L₁ Cox and scout, and it was the first supervised principal component for SPC). We tested the resulting models on the test set. The mean value of 2(log(L) − log(L_o)) over ten separate training/test/validation set splits is given in Table 9, where L denotes the likelihood of the Cox proportional hazards model fit on the test set using the predictor obtained from the training set (for L₁ Cox and scout, this was X_testβ_train, and for SPC, this was the first supervised principal component), and L_o denotes the likelihood of the null model. From Tables 9 and 10, it is clear that the scout results in predictors that are quite competitive with, if not better than, the competing methods on all four data sets.

Table 9.

Mean of 2(log(L) − log(L_o)) on Survival Data. L₁ Cox, supervised principal components, Scout(1, 1), and Scout(2, 1) are compared on the Hummel, Monti, Rosenwald, and Shipp data sets over ten random training/validation/test set splits. The predictor obtained from the training set is fit on the test set using a Cox proportional hazards model, and the median value of 2(log(L) − log(L_o)) over the ten repetitions is reported. For each data set, the two highest mean values of 2(log(L) − log(L_o)) are shown in bold.

Data Set	L₁ Cox	SPC	Scout(1, 1)	Scout(2, 1)
Hummel	2.640	3.823	4.245	3.293
Monti	1.647	1.231	2.149	2.606
Rosenwald	4.129	3.542	3.987	4.930
Shipp	1.903	1.004	2.807	2.627

Open in a new tab

Table 10.

Median Number of Genes Used for Survival Data. L₁ Cox, supervised principal components, Scout(1, 1) and Scout(2, 1) are compared on the Hummel, Monti, Rosenwald, and Shipp data sets over ten random training/validation/test set splits; the median number of genes used in each of the resulting models is reported.

Data Set	L₁Cox	SPC	Scout(1, 1)	Scout(2, 1)
Hummel	14	33	78	13
Monti	18.5	17	801.5	144.5
Rosenwald	37.5	32	294	85
Shipp	5.5	10	4.5	5

Open in a new tab

6. Discussion

We have presented covariance-regularized regression, a class of regression procedures (the “scout” family) obtained by maximizing the log likelihood of the inverse covariance matrix of the data, rather than by minimizing the sum of squared errors, subject to a penalty. We have shown that three well-known regression methods - ridge, the lasso, and the elastic net - fall into the covariance-regularized regression framework. In addition, we have explored some new methods within this framework. We have extended the covariance-regularized regression framework to classification and generalized linear model settings, and we have demonstrated the performance of the resulting methods on a number of gene expression data sets.

A drawback of the scout method is that when p₁ = 1 and the number of features is large, then minimizing the first scout criterion can be quite slow. When more than a few thousand features are present, the scout with p₁ = 1 is not a viable option at present. However, scout with p₁ = 2 is very fast, and we are confident that computational and algorithmic improvements will lead to increases in the number of features for which the scout criteria can be minimized with p₁ = 1.

Covariance-regularized regression represents a new way to understand existing regularization methods for regression, as well as an approach to develop new regularization methods that appear to perform better in practice in many examples.

Acknowledgments

We thank Trevor Hastie for showing us the solution to the penalized log likelihood with an L₂ penalty. We thank both Trevor Hastie and Jerome Friedman for valuable discussions and for providing the code for the L₂ penalized multinomial and the elastic net. Daniela Witten was supported by a National Defense Science and Engineering Graduate Fellowship. Robert Tibshirani was partially supported by National Science Foundation Grant DMS-9971405 and National Institutes of Health Contract N01-HV-28183.

8. Appendix

8.1. Proofs of Claims

8.1.1. Proof of Claim 1

First, suppose that p₂ = 1. Consider the penalized log-likelihood

log det Θ - tr (S Θ) - \frac{λ_{2}}{2} ∣ ∣ Θ ∣ ∣^{1}

(29)

with Θ_xx, the top left p × p submatrix of Θ, fixed to equal the matrix that maximizes the log likelihood in Step 1 of the Scout Procedure. It is clear that if Θ̂ maximizes the log likelihood, then ${({\hat{Θ}}^{- 1})}_{y y} = S_{y y} + \frac{λ_{2}}{2}$ . The subgradient equation for maximization of the remaining portion of the log-likelihood is

0 = {(Θ^{- 1})}_{xy} - S_{xy} - \frac{λ_{2}}{2} Γ

(30)

where Γ_i = 1 if the i^th element of Θ_xy is positive, Γ_i = −1 if the i^th element of Θ_xy is negative, and otherwise Γ_i is between −1 and 1.

Let β = Θ_xx(Θ⁻¹)_xy. Therefore, we equivalently wish to find β that solves

0 = 2 {(Θ_{xx})}^{- 1} β - 2 S_{xy} - λ_{2} Γ

(31)

From the partitioned inverse formula, it is clear that sgn(β) = −sgn(Θ_xy). Therefore, our task is equivalent to finding β which minimizes

β^{T} {(Θ_{xx})}^{- 1} β - 2 S_{xy} β + λ_{2} ∣ ∣ β ∣ ∣^{1}

(32)

Of course, this is Equation 11. It is an L₁-penalized regression of y onto X, using only the inner products, with S_xx replaced with (Θ_xx) ⁻¹. In other words, β̂ that solves Equation 11 is given by Θ_xx(Θ⁻¹)_xy, where Θ solves Step 2 of the Scout Procedure.

Now, the solution to Step 3 of the Scout Procedure is $- \frac{Θ_{xy}}{Θ_{yy}}$ . By the partitioned inverse formula, Θ_xx(Θ⁻¹)_xy + Θ_xy(Θ⁻¹)_yy = 0, so $- \frac{Θ_{xy}}{Θ_{y y}} = \frac{Θ_{xx} {(Θ^{- 1})}_{xy}}{{(Θ^{- 1})}_{y y} Θ_{y y}} = \frac{β}{{(Θ^{- 1})}_{y y} Θ_{y y}}$ . In other words, the solution to Step 3 of the Scout Procedure and the solution to Equation 11 differ by a factor of (Θ⁻¹)_yyΘ_yy. Since Step 4 of the Scout Procedure involves scaling the solution to Step 3 by a constant, it is clear that one can replace Step 3 of the Scout Procedure with the solution to Equation 11.

Now, suppose p₂ = 2. To find Θ_xy that maximizes this penalized log-likelihood, we take the gradient and set it to zero:

0 = {(Θ^{- 1})}_{xy} - S_{xy} - \frac{λ_{2}}{2} Θ_{xy}

(33)

Again, let β = Θ_xx(Θ⁻¹)_xy. Therefore, we equivalently wish to find β that solves

0 = 2 {(Θ_{xx})}^{- 1} β - 2 S_{xy} + 2 λ_{3} β

(34)

for some new constant λ₃, using the fact, from the partitioned inverse formula, that $- \frac{β}{{(Θ^{- 1})}_{y y}} = Θ_{xy}$ . The solution β minimizes

β^{T} {(Θ_{xx})}^{- 1} β - {S_{xy}}^{T} β + λ_{3} β^{T} β

Of course, this is again Equation 11. Therefore, β̂ that solves Equation 11 is given (up to scaling by a constant) by Θ_xx(Θ⁻¹)_xy, where Θ solves Step 2 of the Scout Procedure. As before, by the partitioned inverse formula, and since Step 4 of the Scout Procedure involves scaling the solution to Step 3 by a constant, it is clear that one can replace Step 3 of the Scout Procedure with the solution to Equation 11.

8.1.2. Proof of Claim 2

Recall that the solution to Scout(2, 1) minimizes the following:

β^{T} (V (D^{2} + {\tilde{D}}^{2}) V^{T}) β - 2 β^{T} X^{T} y + λ_{2} ∣ ∣ β ∣ ∣^{1}

(35)

where D̃² is a p × p diagonal matrix with i^th diagonal entry equal to $\frac{1}{2} (- {d_{i}}^{2} + \sqrt{{d_{i}}^{4} + 8 λ_{1}})$ and X = UDV^T. Equivalently, the solution minimizes

\begin{matrix} β^{T} (V (\frac{1}{2} D^{2} + \frac{1}{2} {\bar{D}}^{2} + \sqrt{2 λ_{1}}) V^{T}) β - 2 β^{T} X^{T} y + λ_{2} ∣ ∣ β ∣ ∣^{1} \\ = β^{T} (V (\frac{1}{2} D^{2} + \frac{1}{2} {\bar{D}}^{2}) V^{T}) β - 2 β^{T} X^{T} y + λ_{2} ∣ ∣ β ∣ ∣^{1} + \sqrt{2 λ_{1}} ∣ ∣ β ∣ ∣^{2} \end{matrix}

(36)

where D̄² is the diagonal matrix with elements $\sqrt{{d_{i}}^{4} + 8 λ_{1}} - \sqrt{8 λ_{1}}$ , because V is p × p orthogonal. It is easy to see that the solution also minimizes the following:

∣ ∣ y^{*} - X^{*} β ∣ ∣^{2} + λ_{2} ∣ ∣ β ∣ ∣^{1} + \sqrt{2 λ_{1}} ∣ ∣ β ∣ ∣^{2}

(37)

where $X^{*} = \frac{1}{\sqrt{2}} (\begin{matrix} X \\ \bar{D} V^{T} \end{matrix}), y^{*} = (\begin{matrix} \sqrt{2} y \\ 0 \end{matrix})$ . If β̂ minimizes Equation 37, and if we assume that sgn(β̂_i) = sgn(β̂_j), then it follows that

\sqrt{2 λ_{1}} ∣ {\hat{β}}_{i} - {\hat{β}}_{j} ∣ = ∣ {(x_{i}^{*} - x_{j}^{*})}^{T} (y^{*} - X^{*} \hat{β}) ∣

(38)

Note that

∣ ∣ y^{*} - X^{*} \hat{β} ∣ ∣^{2} \leq ∣ ∣ y^{*} - X^{*} \hat{β} ∣ ∣^{2} + λ_{2} ∣ ∣ \hat{β} ∣ ∣^{1} + \sqrt{2 λ_{1}} ∣ ∣ \hat{β} ∣ ∣^{2} \leq ∣ ∣ y^{*} ∣ ∣^{2} = 2 ∣ ∣ y ∣ ∣^{2}

(39)

Therefore,

∣ {\hat{β}}_{i} - {\hat{β}}_{j} ∣ \leq \sqrt{\frac{1}{2 λ_{1}}} ∣ ∣ x_{i}^{*} - x_{j}^{*} ∣ ∣ ∣ ∣ y ∣ ∣ \sqrt{2}

(40)

Now, $∣ ∣ x_{i}^{*} - x_{j}^{*} ∣ ∣^{2} = \frac{1}{2} ∣ ∣ x_{i} - x_{j} ∣ ∣^{2} + \frac{1}{2} ∣ ∣ {(\bar{D} V^{T})}_{i} - {(\bar{D} V^{T})}_{j} ∣ ∣^{2}$ . Since we assumed that the features are standardized, it follows that $∣ ∣ x_{i}^{*} - x_{j}^{*} ∣ ∣^{2} = 1 - ρ + \frac{1}{2} ∣ ∣ {(\bar{D} V^{T})}_{i} - {(\bar{D} V^{T})}_{j} ∣ ∣^{2}$ where ρ is the correlation between x_i and x_j. It also is easy to see that ||(D̄V^T)_i − (D̄V^T)_j||² ≤ 1 − ρ. Therefore, it follows that

∣ {\hat{β}}_{i} - {\hat{β}}_{j} ∣ \leq \sqrt{\frac{2 (1 - ρ)}{λ_{1}}} ∣ ∣ y ∣ ∣

(41)

8.1.3. Proof of Claim 3

Consider data generated under the latent variable model given in Section 2.6. Note that it follows that

X^{T} X = d_{1}^{2} v_{1} v_{1}^{T} + d_{2}^{2} v_{2} v_{2}^{T} = \sum_{j = 1}^{p} d_{j}^{2} v_{j} v_{j}^{T}

(42)

where d₃ = … = d_p = 0. We consider two options for the regression of y onto X: ridge regression and Scout(2, ·). Let β̂^rr and β̂^sc denote the resulting estimates, and let λ^rr and λ^sc be the tuning parameters of the two methods, respectively.

\begin{array}{l} {\hat{β}}^{r r} = {(X^{T} X + λ^{r r} I)}^{- 1} X^{T} y \\ = (\sum_{j = 1}^{p} \frac{1}{d_{j}^{2} + λ^{r r}} v_{j} v_{j}^{T}) (d_{1} v_{1} u_{1}^{T} + d_{2} v_{2} u_{2}^{T}) (u_{1} + ε) \\ = \frac{d_{1}}{d_{1}^{2} + λ^{r r}} v_{1} + (\frac{d_{1}}{d_{1}^{2} + λ^{r r}} v_{1} u_{1}^{T} + \frac{d_{2}}{d_{2}^{2} + λ^{r r}} v_{2} u_{2}^{T}) ε \end{array}

(43)

Similarly, using the fact that Scout(2, ·) results in replacing $d_{j}^{2}$ with $\frac{1}{2} (d_{j}^{2} + \sqrt{d_{j}^{4} + 8 λ^{s c}})$ :

\begin{array}{l} {\hat{β}}^{s c} = (\sum_{j = 1}^{p} \frac{2}{d_{j}^{2} + \sqrt{d_{j}^{4} + 8 λ^{s c}}} v_{j} v_{j}^{T}) (d_{1} v_{1} u_{1}^{T} + d_{2} v_{2} u_{2}^{T}) (u_{1} + ε) \\ = \frac{2 d_{1}}{d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}} v_{1} + (\frac{2 d_{1}}{d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}} v_{1} u_{1}^{T} + \frac{2 d_{2}}{d_{2}^{2} + \sqrt{d_{2}^{4} + 8 λ^{s c}}} v_{2} u_{2}^{T}) ε \end{array}

(44)

It is clear that the signal part of β̂^rr is $\frac{d_{1}}{d_{1}^{2} + λ^{r r}} v_{1}$ , and that the signal part of β̂^sc is $\frac{2 d_{1}}{d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}} v_{1}$ . The following relationship between λ^rr and λ^sc results in signals that are equal:

λ^{r r} = \frac{- d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}}{2}

(45)

From now on, we assume that Equation 45 holds. It is clear that the noise parts of β̂^rr and β̂^sc are $(\frac{d_{1}}{d_{1}^{2} + λ^{r r}} v_{1} u_{1}^{T} + \frac{d_{2}}{d_{2}^{2} + λ^{r r}} v_{2} u_{2}^{T}) ε$ and $(\frac{2 d_{1}}{d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}} v_{1} u_{1}^{T} + \frac{2 d_{2}}{d_{2}^{2} + \sqrt{d_{2}^{4} + 8 λ^{s c}}} v_{2} u_{2}^{T}) ε$ . Using Equation 45, we know that $\frac{d_{1}}{d_{1}^{2} + λ^{r r}} v_{1} u_{1}^{T} = \frac{2 d_{1}}{d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}} v_{1} u_{1}^{T}$ . So it suffces to compare $\frac{d_{2}}{d_{2}^{2} + λ^{r r}}$ and $\frac{2 d_{2}}{d_{2}^{2} + \sqrt{d_{2}^{4} + 8 λ^{s c}}}$ . Recall that d₂ > 0. So,

\begin{array}{l} (d_{2}^{2} + λ^{r r}) - (\frac{d_{2}^{2} + \sqrt{d_{2}^{4} + 8 λ^{s c}}}{2}) = (d_{2}^{2} + \frac{- d_{1}^{2} + \sqrt{d_{1}^{4} + 8 λ^{s c}}}{2}) - (\frac{d_{2}^{2} + \sqrt{d_{2}^{4} + 8 λ^{s c}}}{2}) \\ = \frac{1}{2} {(\sqrt{d_{1}^{4} + 8 λ^{s c}} - d_{1}^{2}) - (\sqrt{d_{2}^{4} + 8 λ^{s c}} - d_{2}^{2})} \end{array}

(46)

If d₁ > d₂, then the above quantity is negative; if d₁ < d₂, then it is positive. Therefore, the scout solution has a smaller noise term than the ridge solution if and only if d₁ > d₂. In other words, if the portion of X that is correlated with y has a stronger signal than the portion that is orthogonal to y, then Scout(2, ·) will perform better than ridge, because it will shrink the parts of the inverse covariance matrix that correspond to variables that are uncorrelated with the response.

8.2. Feature Selection for Scout LDA

The method that we propose in Section 4.1 can be easily modified in order to perform built-in feature selection. Using the notation in Section 4.1, we observe that

{\hat{μ}}_{k} = arg min_{μ_{k}} {\sum_{i : C (i) = k} {(X_{i} - μ_{k})}^{T} {\sum^{^}}_{wc, λ}^{- 1} (X_{i} - μ_{k})}

(47)

and so we replace μ̂_k in Equation 23 with

{\hat{μ}}_{k}^{λ, ρ} = arg min_{μ_{k}} {\sum_{i : C (i) = k} {(X_{i} - μ_{k})}^{T} {\sum^{^}}_{wc, λ}^{- 1} (X_{i} - μ_{k}) + ρ ∣ ∣ {\sum^{^}}_{wc, λ}^{- 1} μ_{k} ∣ ∣^{1}}

(48)

The above can be solved via an L₁ regression, and it gives the following classification rule for a test observation X:

δ_{k}^{λ, ρ} (X) = X^{T} {\sum^{^}}_{wc, λ}^{- 1} {\hat{μ}}_{k}^{λ, ρ} - \frac{1}{2} {({\hat{μ}}_{k}^{λ, ρ})}^{T} {\sum^{^}}_{wc, λ}^{- 1} {\hat{μ}}_{k}^{λ, ρ} + log π_{k}

(49)

Contributor Information

Daniela M. Witten, Department of Statistics, Stanford University, 390 Serra Mall, Stanford CA 94305, USA. E-mail: dwitten@stanford.edu

Robert Tibshirani, Departments of Statistics and Health Research & Policy, Stanford University, 390 Serra Mall, Stanford CA 94305, USA. E-mail: tibs@stat.stanford.edu.

References

Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLOS Biology. 2004;2:511–522. doi: 10.1371/journal.pbio.0020108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Banerjee O, El Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research 2008 [Google Scholar]
Frank I, Friedman J. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35(2):109–148. [Google Scholar]
Friedman J. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. 2008a In preparation. [PMC free article] [PubMed] [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008b doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society, Series B. 1984;46:149–192. [Google Scholar]
Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007;8:86–100. doi: 10.1093/biostatistics/kxj035. [DOI] [PubMed] [Google Scholar]
Hoerl AE, Kennard R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
Hummel M, Bentink S, Berger H, Klappwe W, Wessendorf S, Barth FTE, Bernd H-W, Cogliatti SB, Dierlamm J, Feller AC, Hansmann ML, Haralambieva E, Harder L, Hasenclever D, Kuhn M, Lenze D, Lichter P, Martin-Subero JI, Moller P, Muller-Hermelink H-K, Ott G, Parwaresch RM, Pott C, Rosenwald A, Rosolowski M, Schwaenen C, Sturzenhofecker B, Szczepanowski M, Trautmann H, Wacker H-H, Spang R, Loefler M, Trumper L, Stein H, Siebert R. A biological definition of Burkitt’s lymphoma from transcriptional and genomic profiling. New England Journal of Medicine. 2006;354:2419–2430. doi: 10.1056/NEJMoa055351. [DOI] [PubMed] [Google Scholar]
Kalbfleisch J, Prentice R. The statistical analysis of failure time data. Wiley; New York: 1980. [Google Scholar]
Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Monti S, Savage KJ, Kutok JL, Feuerhake F, Kurtin P, Mihm M, Wu B, Pasqualucci L, Neuberg D, Aguiar RCT, Dal Cin P, Ladd C, Pinkus GS, Salles G, Harris NL, Dalla-Favera R, Habermann TM, Aster JC, Golub TR, Shipp MA. Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood. 2005;105:1851–1861. doi: 10.1182/blood-2004-07-2947. [DOI] [PubMed] [Google Scholar]
Park MY, Hastie T. An L1 regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B. 2007;69(4):659–677. [Google Scholar]
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J, Poggio T, Gerald W, Loda M, Lander E, Golub T. Multiclass cancer diagnosis using tumor gene expression signature. PNAS. 2002;98:15149–15154. doi: 10.1073/pnas.211566398. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Staudt LM. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. The New England Journal of Medicine. 2002;346:1937–1947. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]
Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine. 2002;8:68–74. doi: 10.1038/nm0102-68. [DOI] [PubMed] [Google Scholar]
Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996;58:267–288. [Google Scholar]
Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tibshirani R, Hastie T, Narasimhan B, Chu G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science. 2003:104–117. [Google Scholar]
Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics. 2004;5(2):427–443. doi: 10.1093/biostatistics/5.3.427. [DOI] [PubMed] [Google Scholar]
Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67:301–320. [Google Scholar]

[R1] Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLOS Biology. 2004;2:511–522. doi: 10.1371/journal.pbio.0020108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Banerjee O, El Ghaoui LE, d’Aspremont A. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research 2008 [Google Scholar]

[R3] Frank I, Friedman J. A statistical view of some chemometrics regression tools (with discussion) Technometrics. 1993;35(2):109–148. [Google Scholar]

[R4] Friedman J. Regularized discriminant analysis. Journal of the American Statistical Association. 1989;84:165–175. [Google Scholar]

[R5] Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. 2008a In preparation. [PMC free article] [PubMed] [Google Scholar]

[R6] Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008b doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Green PJ. Iteratively reweighted least squares for maximum likelihood estimation, and some robust and resistant alternatives. Journal of the Royal Statistical Society, Series B. 1984;46:149–192. [Google Scholar]

[R8] Guo Y, Hastie T, Tibshirani R. Regularized linear discriminant analysis and its application in microarrays. Biostatistics. 2007;8:86–100. doi: 10.1093/biostatistics/kxj035. [DOI] [PubMed] [Google Scholar]

[R9] Hoerl AE, Kennard R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]

[R10] Hummel M, Bentink S, Berger H, Klappwe W, Wessendorf S, Barth FTE, Bernd H-W, Cogliatti SB, Dierlamm J, Feller AC, Hansmann ML, Haralambieva E, Harder L, Hasenclever D, Kuhn M, Lenze D, Lichter P, Martin-Subero JI, Moller P, Muller-Hermelink H-K, Ott G, Parwaresch RM, Pott C, Rosenwald A, Rosolowski M, Schwaenen C, Sturzenhofecker B, Szczepanowski M, Trautmann H, Wacker H-H, Spang R, Loefler M, Trumper L, Stein H, Siebert R. A biological definition of Burkitt’s lymphoma from transcriptional and genomic profiling. New England Journal of Medicine. 2006;354:2419–2430. doi: 10.1056/NEJMoa055351. [DOI] [PubMed] [Google Scholar]

[R11] Kalbfleisch J, Prentice R. The statistical analysis of failure time data. Wiley; New York: 1980. [Google Scholar]

[R12] Meinshausen N, Bühlmann P. High dimensional graphs and variable selection with the lasso. Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R13] Monti S, Savage KJ, Kutok JL, Feuerhake F, Kurtin P, Mihm M, Wu B, Pasqualucci L, Neuberg D, Aguiar RCT, Dal Cin P, Ladd C, Pinkus GS, Salles G, Harris NL, Dalla-Favera R, Habermann TM, Aster JC, Golub TR, Shipp MA. Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood. 2005;105:1851–1861. doi: 10.1182/blood-2004-07-2947. [DOI] [PubMed] [Google Scholar]

[R14] Park MY, Hastie T. An L1 regularization path algorithm for generalized linear models. Journal of the Royal Statistical Society Series B. 2007;69(4):659–677. [Google Scholar]

[R15] Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J, Poggio T, Gerald W, Loda M, Lander E, Golub T. Multiclass cancer diagnosis using tumor gene expression signature. PNAS. 2002;98:15149–15154. doi: 10.1073/pnas.211566398. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, Muller-Hermelink HK, Smeland EB, Staudt LM. The use of molecular profiling to predict survival after chemotherapy for diffuse large B-cell lymphoma. The New England Journal of Medicine. 2002;346:1937–1947. doi: 10.1056/NEJMoa012914. [DOI] [PubMed] [Google Scholar]

[R17] Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nature Medicine. 2002;8:68–74. doi: 10.1038/nm0102-68. [DOI] [PubMed] [Google Scholar]

[R18] Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statist Soc B. 1996;58:267–288. [Google Scholar]

[R19] Tibshirani R, Hastie T, Narasimhan B, Chu G. Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci. 2002;99:6567–6572. doi: 10.1073/pnas.082099299. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Tibshirani R, Hastie T, Narasimhan B, Chu G. Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Statistical Science. 2003:104–117. [Google Scholar]

[R21] Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics. 2004;5(2):427–443. doi: 10.1093/biostatistics/5.3.427. [DOI] [PubMed] [Google Scholar]

[R22] Zou H, Hastie T. Regularization and variable selection via the elastic net. J Royal Stat Soc B. 2005;67:301–320. [Google Scholar]

PERMALINK

Covariance-regularized regression and classification for high-dimensional problems

Daniela M Witten

Robert Tibshirani

Summary

1. Introduction

2. The Scout Method

2.1. The General Scout Family

The Scout Procedure for General Penalty Functions

Table 1.

2.2. Lp Penalties

The Scout Procedure with Lp Penalties

2.3. Simple Example

Fig. 1.

2.4. Minimization of the Scout Criteria with Lp Penalties

Claim 1

Table 2.

Table 3.

2.5. Properties of the Scout

2.5.1. Similarities between Scout, Ridge Regression, and the Elastic Net

2.5.2. Variable Grouping Effect

Claim 2

2.5.3. Connections to Regression with Orthogonal Features

2.6. An Underlying Latent Variable Model

Claim 3

3. Numerical Studies: Regression via the Scout

3.1. Simulated Data

Table 4.

Table 5.

3.2. Making Use of Observations without Response Values

Table 6.

Table 7.

4. Classification via the Scout

4.1. Details of Extension of Scout to Classification

The Scout Procedure for Classification

4.2. Ramaswamy Data

Table 8.

5. Extension to Generalized Linear Models and the Cox Model

5.1. Details of Extension of Scout to the Cox Model

The Scout Procedure for the Cox Model

5.2. Lymphoma Data

Table 9.

Table 10.

6. Discussion

Acknowledgments

8. Appendix

8.1. Proofs of Claims

8.1.1. Proof of Claim 1

8.1.2. Proof of Claim 2

8.1.3. Proof of Claim 3

8.2. Feature Selection for Scout LDA

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.2. L_p Penalties

The Scout Procedure with L_p Penalties

2.4. Minimization of the Scout Criteria with L_p Penalties