Graph-based sparse linear discriminant analysis for high-dimensional classification

Jianyu Liu; Guan Yu; Yufeng Liu

doi:10.1016/j.jmva.2018.12.007

. Author manuscript; available in PMC: 2020 May 1.

Published in final edited form as: J Multivar Anal. 2018 Dec 17;171:250–269. doi: 10.1016/j.jmva.2018.12.007

Graph-based sparse linear discriminant analysis for high-dimensional classification

Jianyu Liu ^a, Guan Yu ^b, Yufeng Liu ^a,^c,^*

PMCID: PMC6980367 NIHMSID: NIHMS1518090 PMID: 31983784

Abstract

Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector β. The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of β and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of β estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.

Keywords: Feature structure, Gaussian graphical models, Regularization, Undirected graph

1. Introduction

Classification problems are commonly seen in practice. There are many existing classification techniques in the literature; see [2, 17] for a comprehensive review. Among various existing methods, linear discriminant analysis (LDA) has a long history and remains an important tool in the standard classification toolbox. LDA can be viewed as a rule for a classification problem of two Gaussian populations with a common covariance matrix. Despite its seemingly strong assumptions, LDA often works well in practice, especially for low-dimensional problems [15]. It mimics Bayes’ rule and has a simple closed form which only involves the within-class sample covariance matrix and group averages. Given these estimates, the original formulation for the discriminant vector of LDA is computed as the product of the inverse within-class sample covariance matrix and the mean difference vector. Thus, standard LDA can be computed and implemented easily in the traditional low-dimensional setting. LDA also has interpretations beyond the Gaussian model. In particular, the same formulation can be obtained from Fisher’s discriminant analysis problem [13], the optimal scoring problem [16], and linear regression [17].

Despite the usefulness of LDA, it needs to be adapted when the dimension of features is high. For example, the form of standard LDA is only valid when the sample covariance matrix is invertible. Moreover, as the dimension grows, the errors in the sample covariance and group means accumulate and consequently LDA can become increasingly unstable [11, 35]. To address this problem, a number of LDA extensions have been proposed for high-dimensional scenarios.

The existing high-dimensional LDA methods in the literature can be roughly divided into two categories, plugin approaches and direct approaches. A plug-in approach tackles high-dimensional problems by using regularized estimates for the within-class covariance matrix and group means. For example, the naive Bayes method, or the independence rule, treats the covariance matrix as diagonal. Bickel and Levina [1] showed that it outperforms LDA with Moore–Penrose pseudoinverse covariance matrix when the dimension grows faster than the sample size. To further reduce the instability of LDA, Tibshirani et al. [36] additionally used shrunken estimates of group means. Fan and Fan [11] showed that, even under the independence feature assumption, naive Bayes can be as bad as random guessing due to error accumulation in group means. They resolved this issue by reducing the dimension via feature screening. In contrast to these independence rules, Shao et al. [35] assumed sparsity of the covariance matrix and the mean difference vector, and used thresholded estimates to construct a sparse LDA classifier. It was shown to be asymptotically optimal under certain conditions. All of these methods adopt the original formulation of LDA by calculating some improved estimates of the covariance matrix and group means. Thus, some strong assumptions on the covariance matrix and the group means need to be imposed for the resulting LDA rule.

In contrast to the plug-in methods, direct approaches aim at estimating the discriminant vector β directly. Since LDA can also be obtained from some risk minimization problems, it can be extended to high-dimensional scenarios via these formulations with regularization on β. For example, Wu et al. [41] considered Fisher’s discriminant analysis and proposed an ℓ₁ -penalized version for dimension reduction. The corresponding problem has a piece-wise linear solution path which can be computed efficiently. Witten and Tibshirani [39] also used Fisher’s discriminant analysis formulation for a general K-class problem with a general regularization. Clemmensen et al. [10] proposed the optimal scoring formulation with the ℓ₁ penalty. Following the idea of minimizing the misclassification rates, Fan et al. [12] proposed a method closely related to the method by Wu et al. [41] and directly computed the misclassification rate of the classifier. Mai et al. [26] took advantage of the regression formulation and estimated the discriminant vector of LDA by solving a Lasso-type problem, which was shown to have the same solution path as the method of Wu et al. [41] and the method of Clemmensen et al. [10] when K = 2; see [25]. Using a different idea of direct estimation, Cai and Liu [6] formed a linear programming problem to estimate β and showed that the error rate of the estimated classifier is close to the Bayes rule under certain conditions. Compared to plug-in approaches, these methods estimate LDA directly and the assumptions can be less stringent since only the sparsity of the discriminant vector of LDA is assumed [6].

Both plug-in and direct methods can work well for certain practical problems. However, these methods do not utilize the feature structure information when available. In practice, features are often correlated with some structure. Such structure can usually be represented by an undirected graph $G$ . Connected features may work together and thus be effective or not effective simultaneously for classification. For instance, in the diagnosis of a disease using genetic information, genes are naturally grouped by their functions or gene pathways. Relevant genes tend to contribute or not contribute to the disease together. Moreover, when the population in consideration is Gaussian, the conditional independence graph, or Gaussian graphical model, often represents a natural structure. By considering such structure information, we are likely to be able to construct a better classifier. For regression problems, there are some methods that utilize the graph structure in the literature; see, e.g., [3, 18, 33, 53]. For example, Li and Li [19] proposed a penalty on the coefficient difference of each pair of connected features. Yang et al. [42] used pairwise ℓ_∞ penalties on relevant features to encourage simultaneous inclusion and exclusion. Based on the decomposition of the regression coefficient vector, Yu and Liu [44] proposed a node-wise penalty. In particular, the regularization term is the summation of penalties over all nodes rather than all edges. Compared to pairwise penalties, the node-wise penalty is better motivated and computationally efficient. More recently, Zhao and Shojaie [51] proposed new inference methods for such graph-constrained estimation.

Despite great progress for regression problems, much less research has been done for classification problems. Structured penalties such as group Lasso and fused Lasso have been employed in classification methods [27, 39], but they are not applicable to a general sparse graph structure among predictors. Zhang et al. [49] considered logistic regression with a combination of ℓ₁ penalty and pairwise ℓ₂ difference penalty. Min et al. [29] generalized the regularization and provided a unified algorithm. However, both methods may also suffer from too much computational burden in high dimensions. Very recently, Wu et al. [40] proposed an unsupervised graph-based variable screening method for general problems.

In this paper, we propose a new method, called graph-based sparse LDA (GSLDA), that exploits the graphical structure of features. GSLDA estimates LDA in high dimensions directly by solving a convex optimization problem. Similar to the sparse regression method in [44], we incorporate the graph structure through a node-wise penalty. In the presence of an underlying feature structure, the new method outperforms existing high-dimensional LDA methods by utilizing the structure directly. As a key component, the graphical structure can be either given or estimated from the training data. In addition, we investigate the relationship between the within-class inverse covariance matrix and overall inverse covariance matrix. Based on these findings, we propose a variant of GSLDA that can utilize unlabeled data, which are often much more accessible than labeled data. We name this variant as the semi-supervised GSLDA. Selection consistency is shown for the estimated discriminant vector. Moreover, we show that the misclassification rate of our classifier converges to the Bayes error rate at a fast rate under certain conditions. Numerical studies are used to demonstrate the performance of this method. In particular, the semi-supervised GSLDA enjoys higher classification accuracy than the original GSLDA method in most cases. This reveals the potential advantages of using unlabeled data in classification problems.

The rest of the paper is organized as follows. In Section 2, we review some existing high-dimensional LDA methods, and introduce our motivations and formulations of our proposed methods. Section 3 focuses on graph estimation and the implementation of GSLDA. In particular, graph estimation methods are discussed for both GSLDA and its variant. In Section 4, theoretical justification is provided for our method. Sections 5 and 6 demonstrate the performance of GSLDA by simulated examples and real data studies respectively. We conclude this paper with some discussion in Section 7. Proofs of the theoretical results are provided in the Appendix.

2. Methodology

In this section, we first review LDA and construct a relationship between β and the graph structure of features in Section 2.1, based on which GSLDA is proposed. We also explain how to estimate the graph structure when it is not directly available and discuss the connections of our methods with several existing classification methods. In Section 2.2, we investigate the overall graph structure of the features and consider a variant of GSLDA which can efficiently utilize unlabeled data.

2.1. Motivation and formulation of GSLDA

We first discuss the problem setting and introduce some notations. Given the training dataset {(x₁, g₁), … , (x_n, g_n)} where for each i ∈ {1, … , n}, $x_{i} \in R^{p}$ is the feature vector and g_i ∈ {1, 2} is the class label. A linear classifier g_β₀,β is defined as follows. For any $x \in R^{p}$ , g_β₀,β(x) = 1 if β₀ + x^⊺β > 0 and 2 otherwise. In particular, we consider the standard setting of the two-class LDA. That is, the binary label G takes 1 with probability π₁ and 2 with probability π₂ = 1 − π₁ and the feature vector X has a conditional Gaussian distribution, i.e., $X ∣ (G = k) \sim N (μ^{(k)}, Σ)$ for k ∈ {1,2}. Under this setting, the Bayes classifier $g_{β_{0}^{*}, β^{*}}$ is specified by

β^{*} = Σ^{- 1} δ and β_{0}^{*} = - {(μ^{(1)} + μ^{(2)})}^{⊺} β^{*} ∕ 2 + \ln (π_{1} ∕ π_{2}),

(1)

where δ = μ⁽¹⁾ − μ⁽²⁾. By replacing Σ and δ in (1) with their sample estimates, we have the LDA classifier with $\hat{β} = {\hat{Σ}}^{- 1} \hat{δ}$ . Typically, we take $\hat{Σ} = (n_{1} S^{(1)} + n_{2} S^{(2)}) ∕ (n - 2)$ and $\hat{δ} = {\overset{‒}{x}}^{(1)} - {\overset{‒}{x}}^{(2)}$ , where n_k, ${\overset{‒}{x}}^{(k)}$ and S^(k) denote respectively the sample size, mean and covariance matrix for group k. Note that this formulation is valid only when n > p. In high-dimensional problems or when n ≤ p, there are various extensions of LDA that either use the formulation with shrunken estimates of Σ and δ or find a direct estimation of β; see [7, 12, 26, 35, 36]. Here we focus on the direct estimation approach.

Inspired by the regression formulation of LDA [17], Mai et al. [26] proposed the direct sparse discriminant analysis (DSDA) method to estimate β by solving the Lasso problem

\hat{β} = \underset{β}{argmin} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊺} β)}^{2} + λ {‖ β ‖}_{1},

where y_i = n/n₁ if g_i = 1 and −n/n₂ if g_i = 2. It was shown that DSDA gives the same solution path as the method in [25, 41]. Compared to plug-in approaches, the DSDA estimates β directly in high dimensions and the assumptions are less stringent. However, it is unclear how we can utilize any structure information among features with the method or other high-dimensional LDA methods.

Assume that there is some structure among the features. In particular, we consider the case where the structure can be represented by a graph, denoted as $G$ . There are methods that effectively use the graph structure in regression problems. For example, Li and Li [19] used the penalty

\sum_{(j, ℓ) \in G} {(β_{j} ∕ \sqrt{d_{j}} - β_{ℓ} ∕ \sqrt{d_{ℓ}})}^{2},

where d_j denotes the neighborhood size of feature j, to encourage close coefficients for connected features. Yang et al. [42] employed pairwise ℓ_∞ penalty for connected features, i.e., $Σ_{(j, ℓ) \in G} \max {∣ β_{j} ∣, ∣ β_{ℓ} ∣}$ , so their coefficients can be estimated zero or nonzero simultaneously. Recently, Yu and Liu [44] proposed a node-wise penalty

P_{G, τ} (β) = \min_{\sum_{j = 1}^{p} v^{(j)} = β, supp (v^{(j)}) \subseteq N^{(j)}} \sum_{j = 1}^{p} τ_{j} {‖ v^{(j)} ‖}_{2}

based on the decomposition of regression coefficient vector β = var(X)⁻¹cov(X, Y). In contrast to these developments for regression problems, little work has been done for classification problems.

We propose our method formulation based on a decomposition of β*, the discriminant vector of Bayes’ rule. Denote Ω = Σ⁻¹ the within-class precision matrix and δ = μ⁽¹⁾ − μ⁽²⁾ the group mean difference. We can decompose the discriminant vector β* in (1) as

β^{*} = Ω δ = \sum_{j = 1}^{p} δ_{j} ω_{j},

(2)

where ω_j is the jth column of Ω. Recall that the support of Ω in fact forms a conditional correlation graph of features X. In this way, the optimal discriminant vector is linked to the Gaussian graph structure of the features. We use a toy example for demonstration. In a 3-dimensional LDA setting, assume ω₂₃ = ω₃₂ = 0, then β* = Ωδ = (δ₁ω₁₁ + δ₂ω₂₁ + δ₃ω₃₁, δ₁ω₁₂ + δ₂ω₂₂, δ₁ω₁₃ + δ₃ω₃₃)^⊺. See Figure A.1 in the Appendix for a graphical demonstration of the decomposition.

Denote the graph corresponding to Ω as $G$ , and the neighborhood of feature j ∈ {1, … , p} as $N^{(j)}$ . Replacing δ_jω_j by v^(j), then β* = v⁽¹⁾ + ⋯ + v^(p), where v^(j) is either 0 (when δ_j = 0) or with a support $supp (v^{(j)}) = N^{(j)}$ when δ_j ≠ 0. Instead of estimating β* itself, we can estimate v^(j)’s. Moreover, the decomposition (2) motivates a natural regularization on {v⁽¹⁾, … , v^(p)}, viz.

\sum_{j = 1}^{p} τ_{j} {‖ v^{(j)} ‖}_{2},

in which $supp (v^{(j)}) \subseteq N^{(j)} = supp (ω_{j})$ and the τ_js are positive weights. Note that the group ℓ₂ penalty on v^(j) encourages a group sparsity effect, i.e., v^(j) is estimated as 0 or a sparse vector with support $N^{(j)}$ , which matches the decomposition (2). In the formulations, the τ_js are weights for the group regularization. In particular, the larger τ_j is, the more likely v^(j) is estimated as 0. Similar to the group Lasso [45], we can take

τ_{j} = \sqrt{∣ N^{(j)} ∣} ∕ ∣ {\hat{δ}}_{j} ∣,

where ${\hat{δ}}_{j} = {\overset{‒}{x}}_{j}^{(1)} - {\overset{‒}{x}}_{j}^{(2)}$ .

We need to apply this regularization to a risk minimization framework of LDA to formulate our method. The regression formulation is an appropriate one due to its simplicity and convenience for theoretical analysis. By combining the formulation with the group regularization, we can estimate {v₁, … , v_p} by

({\hat{β}}_{0}, {\hat{v}}_{1}, \dots, {\hat{v}}_{p}) = \underset{β_{0}, v_{1}, \dots, v_{p}}{argmin} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊺} \sum_{j = 1}^{p} v^{(j)})}^{2} + λ \sum_{j = 1}^{p} τ_{j} {‖ v^{(j)} ‖}_{2},

(3)

where $supp (v^{(j)}) \subseteq N^{(j)}$ for all j ∈ {1, … , p}. Then β is estimated as ${\hat{v}}_{1} + \dots + {\hat{v}}_{p}$ . Furthermore, from the perspective of β estimation, the formulation is equivalent to

({\hat{β}}_{0}, \hat{β}) = \underset{β_{0}, β}{argmin} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊺} β)}^{2} + λ {‖ β ‖}_{G, τ},

(4)

where

{‖ β ‖}_{G, τ} = \min_{\sum_{j = 1}^{p} v^{(j)} = β, supp (v^{(j)}) \subseteq N^{(j)}} \sum_{j = 1}^{p} τ_{j} {‖ v^{(j)} ‖}_{2}

(5)

can be viewed as a structured regularization on β; see [31]. Since the regularization is specified by the graph $G$ , we call the method graph-based sparse LDA (GSLDA). Although we use the same squared loss function as in [26], our method focuses on utilizing the graph structure of features in β* estimation. We use the estimator $\hat{β}$ from (4) for the discriminant vector β. With respect to β₀, the estimator from (4) may not be a good choice for the classification problem due to the regression formulation. To solve this problem, we adopt a similar approach by [26] and estimate it by

{\hat{β}}_{0} = - {({\overset{‒}{x}}^{(1)} + {\overset{‒}{x}}^{(2)})}^{⊺} \hat{β} ∕ 2 + \ln (n_{1} ∕ n_{2}) {\hat{β}}^{⊺} \hat{Σ} \hat{β} ∕ ({\hat{δ}}^{⊺} \hat{β}) .

While the GSLDA method is motivated from the discriminant vector decomposition (2), the decomposition of β* is not restricted to this form only. Therefore, the graph structure $G$ used in our method is not restricted to the conditional independence graph. We will present another decomposition of β* in Section 2.2. In fact, any graph structure of features satisfying our assumptions in Section 4.1 can be possibly used. When the structure information is available, e.g., the gene pathways in genetic studies, we can construct a graph $G$ using the gene pathway information. If the graph is not available, we can estimate it based on the training data. There are many methods for estimation of Gaussian graphical models, including the neighborhood selection [28], the graphical Lasso [14, 46], and the CLIME [7]. We will discuss them further in Section 3. In summary, GSLDA can be implemented in two steps: (i) graph construction and (ii) direct estimation of β via solving formulation (4).

The formulation (4) is closely related to the regression method proposed in [44]. However, both the problem setting and the motivation of our paper are different. In our problem, the response y is a binary variable and the features are from a mixed population. Although our formulation also uses the squared loss as in regression, the “error” $y_{i} - x_{i}^{⊺} β$ has a very different interpretation and distribution. In particular, the distribution of $y_{i} - x_{i}^{⊺} β$ depends on x_i. These issues bring unique challenges for the theoretical analysis of GSLDA. Although there are some classification methods that also utilize predictor structure, such as logistic regression with group Lasso penalty [27] and LDA with fused Lasso penalty [39], these methods do not utilize a general graph structure.

Depending on the feature structure, there are special cases in which GSLDA is closely connected with existing sparse LDA methods. For example, if we use an empty graph $G$ with no edge at all, the regularization (5) simplifies to τ₁|β₁| + ⋯ + τ_p|β_p|. Then, formulation (4) becomes an adaptive Lasso type problem, viz.

\underset{β_{0}, β}{argmin} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊺} β)}^{2} + λ \sum_{j = 1}^{p} τ_{j} ∣ β_{j} ∣ .

When all penalty weights τ_j take value 1, the GSLDA is equivalent to the DSDA method in [26]. When the graph $G$ consists of K disjoint complete subgraphs, denoted as $G^{(1)}, \dots, G^{(K)}$ , then the regularization (5) simplifies to $τ^{(1)} {‖ β_{G^{(1)}} ‖}_{2} + \dots + τ^{(K)} {‖ β_{G^{(K)}} ‖}_{2}$ where τ^(k) = min_j∈G^(k) τ_j and G^(k) is the index set of predictors involved in the subgraph $G^{(k)}$ . In this case, GSLDA becomes a variant of DSDA with the group Lasso penalty, i.e.,

\underset{β_{0}, β}{argmin} \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊺} β)}^{2} + λ \sum_{k = 1}^{K} τ^{(k)} {‖ β_{G^{(k)}} ‖}_{2} .

For a general graph $G$ , our method is different from the existing ones.

Remark 1. While we are mainly concerned with binary classification in this paper, there are many scenarios with more than two classes [21, 47, 48]. Our GSLDA method can also be extended to the multi-class case. For example, consider a formulation of K-class sparse LDA proposed in [24], viz.

({\hat{θ}}_{2}, \dots, {\hat{θ}}_{K}) = \underset{θ_{2}, \dots, θ_{K}}{argmin} \sum_{k = 2}^{K} {θ_{k}^{⊺} \hat{Σ} θ_{k} ∕ 2 - {({\overset{‒}{x}}^{(k)} - {\overset{‒}{x}}^{(1)})}^{⊺} θ_{k}} + λ \sum_{j = 1}^{p} {‖ θ_{\cdot j} ‖}_{2},

where θ₂, … , θ_K are discriminant vectors and $θ_{\cdot j} = {(θ_{2 j}, \dots, θ_{K j})}^{⊺}$ for j ∈ {1, … , p}. The resulting discriminant rule is $\hat{g} = {argmax}_{k} {{\hat{θ}}_{k}^{⊺} (x - {\overset{‒}{x}}^{(k)} ∕ 2) + \ln {\hat{π}}_{k}}$ where ${\hat{θ}}_{1} = 0$ and ${\hat{π}}_{k}$ is the proportion of class k in the sample. We can take advantage of a similar formulation with the graph-based regularization $λ {‖ θ_{2}, \dots, θ_{K} ‖}_{G, τ, grouped}$ , where

{‖ θ_{2}, \dots, θ_{K} ‖}_{G, τ, grouped} = \underset{\sum_{j = 1}^{p} v_{k}^{(j)} = θ_{k}, supp (v_{k}^{(j)}) \subseteq N^{(j)}}{argmin} \sum_{j = 1}^{p} τ_{j} {‖ {(v_{2}^{(j) ⊺}, \dots, v_{K}^{(j) ⊺})}^{⊺} ‖}_{2} .

This formulation can be solved in a way similar to the binary GSLDA. Nevertheless, we do not pursue this direction in the paper so we can focus on core ideas of the GSLDA.

2.2. Semi-supervised GSLDA

With recent advances in graphical estimation [7, 28, 46], we can estimate $G$ for the GSLDA based on the training data when the graph structure is unknown. However, as the dimension p increases, we expect the selection error to accumulate. When the dimension is much larger than the sample size, the graph estimate of GSLDA can be almost random. We use a toy example in Figure 1 to illustrate this phenomenon. In the setting of standard LDA, we set weights π₁ = π₂ = 0.5, and group means μ⁽¹⁾ = (0.5, … , 0.5, 0, … , 0)^⊺ and μ⁽²⁾ = (−0.5, … , −0.5, 0, … , 0)^⊺, which only differ in the first 10 features. To specify the graph structure, Ω is generated from an AR(5) model, i.e., Ω_jj = c, Ω_jℓ = −0.5 if 1 ≤ |j − ℓ| ≤ 5 and 0 otherwise, where c > 0 is a scalar such that the eigenvalues of Ω are between 0 and 1. We standardize Ω so that diag(Ω) = 1 and define in-class covariance matrix Σ = Ω⁻¹. Let the sample size n be 50 and p vary from 10 to 200. We estimate the graph by SR-SLasso [22] with extended BIC for tuning. For each setting, we repeat the procedure 100 times and evaluate the accuracy of graph estimation by false positive rate (FPR) and false negative rate (FNR). Figure 1 summarizes the performance of graph estimation for varying dimensions.

As shown in Figure 1, the graph estimation using only labeled data deteriorates quickly as the dimension increases. Note that the structured penalty in (5) encourages the coefficients of all features in a neighborhood to be nonzero together as long as some of them is useful for classification. Inaccurate graph estimation can reduce the accuracy and the interpretability of GSLDA.

Compared to labeled data, unlabeled data can be more accessible in many applications. For example, in the handwritten digit recognition problem discussed in Section 6.2, we can easily obtain a large number of images of different digits. However, it can be expensive to label these images by corresponding digits. As a result, many semi-supervised methods try to utilize the unlabeled data to improve the classification accuracy [5, 32]. In this paper, we focus on using unlabeled data for the graph construction when available. The following proposition studies the relationship between the within-class inverse covariance matrix and the overall one.

Proposition 1. Assume X comes from a mixture of two populations with a common covariance matrix Σ. The weight and the expectation of population k ∈ {1, 2} is π_k and μ^(k). Denote the mean difference of the two populations μ⁽¹⁾−μ⁽²⁾ as δ. We denote $\tilde{Σ} = var (X)$ the overall covariance matrix of the population mixture and $\tilde{Ω} = {\tilde{Σ}}^{- 1}$ the overall precision matrix. Then $\tilde{Σ} = Σ + π_{1} π_{2} δ δ^{⊺}$ and $\tilde{Ω} = Ω - c β^{*} {β^{*}}^{⊺}$ , where β* = Ωδ and $c = 1 ∕ {{(π_{1} π_{2})}^{- 1} + δ^{⊺} Ω δ}$ .

As a remark, we do not require any specific distribution for the populations in Proposition 1, while β* is the optimal discriminant vector if both classes are Gaussian populations. The overall precision matrix $\tilde{Ω}$ is sparse if both Ω and β* are sparse, and its support forms the conditional correlation graph of the mixed population. Moreover, we have $\tilde{Ω} δ = (1 - c β^{* ⊺} δ) β^{*} \propto β^{*}$ . In our problem, a decomposition of the optimal discriminant vector analogous to (2) using $\tilde{Ω}$ can be written as

β^{*} = ξ \sum_{j = 1}^{p} δ_{j} {\tilde{w}}_{j},

where ξ is a positive scalar and ${\tilde{w}}_{j}$ is the jth column of $\tilde{Ω}$ . Therefore, the Bayes classifier can be connected to the graph structure of the mixed population through the new decomposition. Define the graph corresponding to the support of $\tilde{Ω}$ as $\tilde{G}$ . Following the same rationale of GSLDA, we can formulate another estimator of β based on the overall graph structure, viz.

({\hat{β}}_{0}, \hat{β}) = \underset{β_{0}, β}{argmin} \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - β_{0} - x_{i}^{⊺} β)^{2} + λ {‖ β ‖}_{\tilde{G}, \tilde{τ}},

(6)

where ${‖ β ‖}_{\tilde{G}, \tilde{τ}}$ is defined in (5) and $\tilde{τ}$ adapts to $\tilde{G}$ as in (4). The only difference between (6) and (4) is which graph structure we use. When unlabeled data are abundant, the estimated graph $\tilde{G}$ can be more accurate and thus the new formulation may provide better classification. We name the formulation (6) as semi-supervised GSLDA. Similar to the original GSLDA, the semi-supervised variant also has two step: (i) graph estimation based on all available data and (ii) direct estimation of β by solving formulation (6).

Both versions of GSLDA need to estimate a graph when no prior graph structure is given. But there is a major difference: unlike $G$ in (4), the graph $\tilde{G}$ in (6) is not for a Gaussian population but a Gaussian mixture. As we will see in Section 3, likelihood-based estimation such as graphical Lasso would be too complicated to implement. Instead, we can still use neighborhood selection. In fact, in regressing the feature X_j on the other features X_−j, the coefficient vector corresponds to the conditional correlations between X_j and other features regardless of the distribution of the features, as stated by the following lemma.

Lemma 1. For any random vector X = (X₁, … , X_p)^⊺ ~ F, assume we have finite second-order moments and denote $\tilde{μ} = E_{F} (X)$ , $\tilde{Σ} = E_{F} {(X - \tilde{μ}) {(X - \tilde{μ})}^{⊺}}$ and $\tilde{Ω} = {\tilde{Σ}}^{- 1}$ . Then for any j, ℓ ∈ {1, … , p},

(i)
${\tilde{ω}}_{j ℓ}$ , the (j, ℓ)th element of $\tilde{Ω}$ , is 0 if and only if X_j and X_ℓ are conditionally uncorrelated, i.e., cov(X_j, X_ℓ|X_−{j,ℓ}) = 0, where X_−{j,ℓ} denotes all features other than X_j and X_ℓ;
(ii)
${\tilde{ω}}_{j ℓ}$ is 0 if and only if $γ_{ℓ}^{(j)} = 0$ , where $γ_{ℓ}^{(j)}$ is the coefficient of X_ℓ in the regression of X_j on X_−j.

This lemma is closely related to the results in [28]. According to Lemma 1, the graph based on the inverse covariance matrix always corresponds to the conditional correlation structure. As long as variable selection consistency of the regression is guaranteed, neighborhood selection methods are valid for graph estimation. Figure 1 also shows the performance of graph estimation based on a large unlabeled dataset under the same settings. We can observe that the estimation still performs well when the dimension increases.

Remark 2. In practice, we generally use all available data, including both unlabeled and labeled data, in the first step of semi-supervised GSLDA. Note that even without unlabeled data, the method is still applicable. If we use neighborhood selection for graph estimation, then the error variance of the jth node-wise regression is $var (X_{j} ∣ X_{- j}) = 1 ∕ {\tilde{ω}}_{jj} = 1 ∕ (ω_{jj} - c β_{j}^{* 2})$ by Proposition 1. In contrast, when using the labels as in the original GSLDA, the error variance is $var (X_{j} ∣ X_{- j}, G) = 1 ∕ ω_{jj} < 1 ∕ (ω_{jj} - c β_{j}^{* 2})$ . Therefore, the semi-supervised GSLDA has better graph estimation only when unlabeled data are abundant. When there are relatively little unlabeled data, the original GSLDA is more advantageous.

3. Graph estimation and method implementation

If the feature structure is given from prior knowledge, the graph can be directly constructed by assigning edges between related features. Otherwise, we need to estimate the graph based on training data. In particular, when unlabeled data are available, we can also use that to estimate the graph and implement semi-supervised GSLDA. In this section, we first discuss specific graph estimation methods for GSLDA. Then we introduce algorithms to solve formulation (4) as well as some strategies for efficient implementation.

3.1. Graph estimation

There have been extensive studies on graphical model estimation [7, 9, 14, 28, 38, 46]. As we discussed in Section 2.2, the graph estimation based on labeled and unlabeled data are different to some extent. Next we discuss them separately. Given labeled data, the likelihood conditional on the labels becomes

(2 π)^{- p n ∕ 2} ∣ Ω ∣^{n ∕ 2} \exp {- \frac{1}{2} \sum_{g_{i} = 1} (x_{i} - μ^{(1)})^{⊺} Ω (x_{i} - μ^{(1)}) - \frac{1}{2} \sum_{g_{i} = 2} (x_{i} - μ^{(2)})^{⊺} Ω (x_{i} - μ^{(2)})} .

Similar to the graphical Lasso, we can estimate Ω by minimizing ℓ₁ penalized log-likelihood, i.e.,

\underset{μ^{(1)}, μ^{(2)}, Ω \in S_{+ +}}{argmin} \frac{n}{2} \ln ∣ Ω ∣ - \frac{1}{2} \sum_{g_{i} = 1} {(x_{i} - μ^{(1)})}^{⊺} Ω (x_{i} - μ^{(1)}) - \frac{1}{2} \sum_{g_{i} = 2} {(x_{i} - μ^{(2)})}^{⊺} Ω (x_{i} - μ^{(2)}) + λ {‖ Ω ‖}_{1},

where $S_{+ +}$ denotes the set of p-dimensional positive definite matrices and ||Ω||₁ = Σ_j≠ℓ|ω_jℓ|. It results in ${\hat{μ}}^{(1)} = {\overset{‒}{x}}^{(1)}, {\hat{μ}}^{(2)} = {\overset{‒}{x}}^{(2)}$ , and

\hat{Ω} = \underset{Ω \in S_{+ +}}{argmin} \frac{n}{2} \ln ∣ Ω ∣ - \frac{1}{2} \sum_{i = 1}^{n} {(x_{i} - {\overset{‒}{x}}^{(g_{i})})}^{⊺} Ω (x_{i} - {\overset{‒}{x}}^{(g_{i})}) + λ {‖ Ω ‖}_{1} .

(7)

This is equivalent to the graphical Lasso for the centered data $x_{1} - {\overset{‒}{x}}^{(g_{1})}, \dots, x_{n} - {\overset{‒}{x}}^{(g_{n})}$ .

Instead of solving (7), we can also estimate the graph by neighborhood selection as proposed by [28]. This method solves p node-wise regularized regressions, viz.

\underset{γ_{0}^{(1 j)}, γ_{0}^{(2 j)}, γ^{(j)}}{argmin} \frac{1}{2 n} {‖ X_{j}^{(1)} - γ_{0}^{(1 j)} - X_{- j}^{(1)} γ^{(j)} ‖}_{2}^{2} + \frac{1}{2 n} {‖ X_{j}^{(2)} - γ_{0}^{(2 j)} - X_{- j}^{(2)} γ^{(j)} ‖}_{2}^{2} + λ {‖ γ^{(j)} ‖}_{1},

where $X_{j}^{(k)}$ denotes the jth feature of sample from group k and $X_{- j}^{(k)}$ represents the other features. One can verify that

{\hat{γ}}^{(j)} = \underset{γ^{(j)}}{argmin} \frac{1}{2 n} {‖ {\dot{X}}_{j} - γ_{0} - {\dot{X}}_{- j} γ^{(j)} ‖}_{2}^{2} + λ {‖ γ^{(j)} ‖}_{1},

(8)

where $\dot{X}$ denotes the data centered by subtracting corresponding group means. We can also use sequential Lasso [23] for computational efficiency. The graph $G$ is constructed by connecting nodes j and ℓ if ${\hat{γ}}_{ℓ}^{(j)} \neq 0$ and/or ${\hat{γ}}_{j}^{(ℓ)} \neq 0$ .

Both approaches for estimating $G$ have been justified theoretically [28, 46]. In this paper, we recommend to use neighborhood selection approaches for GSLDA. The main reason is that the former approaches, such as graphical Lasso, usually run through many iterations and can be slow for high-dimensional data (p > 1000). In contrast, neighborhood selection approaches only require p penalized regressions. Moreover, our direct interest is not Ω but the graph $G$ on which neighborhood selection focuses. We use the extended BIC (EBIC) [8] to select λ in (8). As suggested in [8], we choose 1 − 1/(2 log_n p) as the EBIC tuning parameter.

When we have an extra unlabeled dataset, denoted as x_n+1, … , x_n+m, the likelihood becomes complicated because of the Gaussian mixture distribution of the unlabeled data. Thus it is difficult to estimate the parameters via likelihood. Moreover, the graph we need is directly related to $\tilde{Ω} = var {(X)}^{- 1}$ rather than Ω. Thus, a penalized likelihood approach is not suitable. Nevertheless, the neighborhood selection approaches are still valid by Lemma 1, because we are concerned with conditional correlation. In particular, we estimate the neighborhoods by

{\hat{\tilde{γ}}}^{(j)} = \underset{{\tilde{γ}}^{(j)}}{argmin} \frac{1}{2 (n + m)} {‖ {\tilde{X}}_{j} - {\tilde{γ}}_{0} - {\tilde{X}}_{- j} {\tilde{γ}}^{(j)} ‖}_{2}^{2} + λ {‖ {\tilde{γ}}^{(j)} ‖}_{1},

where $\tilde{X} = {(x_{1}, \dots, x_{n}, x_{n + 1}, \dots, x_{n + m})}^{⊺}$ denotes the combined feature matrix. Similarly, we use EBIC to select the tuning parameter λ.

3.2. Parameter estimation and tuning parameter selection

Given the graph $G$ , formulation (4) is a latent group Lasso problem [31]. It can be transformed to an ordinary group Lasso problem as stated in Problem (3). There are many efficient algorithms to solve group Lasso problems, for example, groupwise majorization descent [43]. For very high-dimensional data, we use an iterative proximal algorithm as in [44]. For implementation, we use cross validation for tuning parameter selection.

3.3. Pre-screening

Suppose that there are some entries of δ being zero. Then β* can be a linear combination of only a few column vectors,

β^{*} = \sum_{j \in J} δ_{j} w_{j},

where J = {j : δ_j ≠ 0}. Using two-sample t tests for screening, we can specify J′ ⊂ {1, … , p}, which is a superset of J with a large probability. In particular, we have the following lemma.

Lemma 2. Define the t-statistic $T_{j} = {\hat{δ}}_{j} ∕ {s_{j}^{(1) 2} ∕ n_{1} + s_{j}^{(2) 2} ∕ n_{2}}^{1 ∕ 2}$ , where $s_{j}^{(k) 2}$ is the sample variance of feature j ∈ {1, … , p} in group k ∈ {1,2}. Assume ln p = o(n^γ), ln |J| = o(n^1/2−γ B_n), and $\min_{j \in J} ∣ δ_{j} ∣ \sqrt{2 Σ_{jj}} = B_{n} ∕ n^{γ}$ for some γ ∈ (0,1/3) and B_n → ∞. Then there exists C > 0 such that

\lim_{n \to \infty} \Pr {\min_{j \in J} ∣ T_{j} ∣ \geq {Cn}^{γ ∕ 2}, \max_{j \notin J} ∣ T_{j} ∣ < {Cn}^{γ ∕ 2}} = 1 .

The result in Lemma 2 was previously obtained by Fan and Fan [11] and the corresponding proof is omitted. Lemma 2 guarantees the accuracy of our pre-screening procedure.

After feature screening, the proposed regularization can be simplified as follows:

{‖ β ‖}_{G_{j^{'}}, τ} = \min_{Σ_{j \in J^{'}} v^{(j)} = β, supp (v^{(j)}) \subseteq N^{(j)}} \sum_{j \in J^{'}} τ_{j} {‖ v^{(j)} ‖}_{2} .

(9)

Compared with the original regularization (5), the new one in (9) is often simpler and enjoys computational advantages. Moreover, the new regularization (9) only requires part of the graph, i.e., the part corresponding to the support of {ω_j : j ∈ J′}. Graph estimation methods based on neighborhood selection fit into this idea naturally. When δ is approximately sparse and |J′| ⪡ p, the computational cost can be reduced substantially. Unlike the feature screening in [11], features outside J′ are not necessarily excluded. Instead, they can be introduced into the model via connection with other features in J′.

4. Theoretical properties

In this section, we study the theoretical properties of GSLDA. In particular, the original GSLDA in (4) with a known graph $G$ is considered. Since the semi-supervised GSLDA only differs from GSLDA in the graph used, we do not consider it separately. In Section 4.1, we show the selection consistency of GSLDA. In Section 4.2, we study the misclassification rate of the GSLDA and compare it with the Bayes error.

Before diving into the theoretical analysis, we first introduce some notations for our setting. We define, for an n-dimensional vector a, ||a||_∞ = max(|a₁|, … , |a_n|); for an n × m matrix A, ||A||_∞ = max_i{|A_i1| + ⋯ + |A_im|} and |||A|||_∞ = max_i,j |A_ij|. We consider the problem setting of standard LDA, in which both within-class populations are Gaussian, i.e., $N (μ^{(1)}, Σ)$ and $N (μ^{(2)}, Σ)$ . The discriminant vector of the Bayes rule, denoted as β*, is given in (1). Denote $A = {j : β_{j}^{*} \neq 0}$ the active set, and s = |A|. Define $β^{†} = \tilde{Ω} δ$ , then β^† is proportional to β* (Proposition 1) and thus defines an equivalent classifier.

4.1. Selection consistency

Assume that the feature vectors are centralized, thus $\hat{β} = {argmin}_{β} {‖ y - X β ‖}_{2}^{2} ∕ n + λ {‖ β ‖}_{G, τ}$ . Denote $\tilde{S} = X^{⊺} X ∕ n$ and $κ = {‖ {\tilde{Σ}}_{A^{C} A} {\tilde{Σ}}_{AA}^{- 1} ‖}_{\infty}$ . Define

{\tilde{τ}}_{j} = \min_{ℓ} {τ_{ℓ} : j \in N^{(ℓ)}}, τ^{*} = \max_{j \in A} {\tilde{τ}}_{j}, τ_{*} = \min_{j \in A^{C}} τ_{j} {∣ N^{(j)} ∣}^{- 1 ∕ 2} .

We present several assumptions to be used as follows.

(A1)
p = O{exp(n^γ)}, s = o(n^a), for some γ ∈ (0,1), a ∈ (0, (1 − γ)/2).
(A2)
For every j ∈ {1, … , p}, either $N^{(j)} \subseteq A$ or $N^{(j)} \subseteq A^{C}$ .
(A3)
${‖ {\tilde{Σ}}_{AA}^{- 1} ‖}_{\infty}$ is bounded by φ < ∞.
(A4)
${‖ {\tilde{Σ}}_{A^{C} A} {\tilde{Σ}}_{AA}^{- 1} ‖}_{\infty} < τ_{*} ∕ τ^{*}$ .
(A5)
$b = \min_{j \in A} ∣ β_{j}^{†} ∣ ⪢ \sqrt{\ln p ∕ n}$ .

Here (A1) specifies the order of feature dimension as well as the number of discriminating features. By Assumption (A2), a discriminative feature can only be connected with other discriminative features. This is a reasonable condition in reality since a feature is often relevant for classification if it is related to another useful feature. Condition (A3) ensures that there is no extreme collinearity among discriminative features. Assumption (A4) is an irrepresentability condition that is often employed in showing the selection consistency of regularized estimators [28, 50].

It may not be immediately clear why we impose the irrepresentability condition (A4) on $\tilde{Ω}$ rather than Ω. Note that the more similarity between predictive and non-predictive features, the more difficult it is to achieve selection consistency. While Ω encodes the within-class feature dependence, the relationship among features in the whole dataset is determined by the overall covariance. Thus we impose the condition on $\tilde{Ω}$ . The main theoretical result on the selection consistency of the GSLDA is given in the following theorem.

Theorem 1 (Selection consistency). Under conditions (A1)-(A5), let $\sqrt{\ln p ∕ n} \leq λ τ^{*} \leq O (b)$ and n be sufficiently large, then the GSLDA recovers the active set A and ${‖ {\hat{β}}_{A} - β_{A}^{†} ‖}_{\infty} = O (\sqrt{\ln p ∕ n})$ with probability at least 1 − O(p^−C₁) for some C₁ > 0.

When we use an empty graph $G$ and set τ_j = 1 for all j, our GSLDA is equivalent to the DSDA method. In this special case, τ* = τ_* = 1, and the selection consistency conditions are similar to those for DSDA [26].

4.2. Convergence rate

With respect to a classifier, the error rate is one of the most important performance measures. In this section, we investigate the misclassification rate of GSLDA. We first present some basic results on the classification problem. For a linear classifier ĝ, denote its classification error under our settings as $Q_{β_{0}, β} = \Pr {g_{β_{0}, β} (X) \neq G}$ . Then we have the following results from [6].

Lemma 3 (Classification error rate in LDA setting). Under our setting,

Q_{β_{0}, β} = \frac{1}{2} Φ (\frac{- β_{0} - β^{⊺} μ^{(1)}}{\sqrt{β^{⊺} Σ β}}) + \frac{1}{2} Φ (\frac{β_{0} + β^{⊺} μ^{(2)}}{\sqrt{β^{⊺} Σ β}}),

where Φ denotes the cumulative distribution function of $N (0, 1)$ . The misclassification rate of the Bayes classifier $g_{β_{0}^{*}, β^{*}}$ is $Q_{β_{0}^{*}, β^{*}} = Φ (- Δ^{1 ∕ 2} ∕ 2)$ , where Δ = δ^⊺Ωδ.

Since Q is a continuous function of β₀ and β, the misclassification rate of the GSLDA classifier is asymptotically the same as the Bayes error rate, i.e., $Q_{{\hat{β}}_{0}, \hat{β}} \overset{p}{\to} Q_{β_{0}^{*}, β^{*}}$ , as long as $\hat{β} \overset{p}{\to} β^{*}$ . A more interesting problem is the order of the misclassification rate of the GSLDA when $Q_{β_{0}^{*}, β^{*}} \to 0$ . To investigate this, we first introduce a new condition, under which we can construct an ℓ₂ error bound for the GSLDA estimator.

(A6)
Denote $C (A) = {Δ \in R^{p} : {‖ Δ_{A^{C}} ‖}_{G, τ} \leq 3 {‖ Δ_{A} ‖}_{G, τ}}$ , where $Δ_{A} = {(Δ_{j} 1 (j \in A))}_{p \times 1}$ and $Δ_{A^{C}} = {(Δ_{j} 1 (j \notin A))}_{p \times 1}$ . For all $Δ \in C (A)$ , $Δ^{⊺} \tilde{Σ} Δ ∕ Δ^{⊺} Δ \geq σ > 0$ .

This is actually a restricted eigenvalue condition, which is often used in showing the error bound for regularized estimators [30]. Compared to the irrepresentability condition (A4), this is much less stringent. With the new condition, we have the following ℓ₂ error bound for the GSLDA estimator.

Theorem 2 (ℓ₂-error bound). Under conditions (A1)–(A2) and (A6), let $λ \geq 4 C_{2} (1 + {‖ β^{†} ‖}_{1}) \sqrt{\ln p ∕ n}$ for some C₂ > 0 and n be sufficiently large, then ${‖ \hat{β} - β^{†} ‖}_{2}^{2} \leq 9 λ^{2} s τ^{* 2} ∕ σ^{2}$ with probability at least 1 − sp^−C₃ for some C₃ > 0.

Based on Theorem 2 above, we can establish the asymptotic error rate of the GSLDA classifier as follows.

Theorem 3 (Convergence rate). Under conditions (A1)–(A2) and (A6), as n, p → ∞, if Δ → ∞, we have

Q_{β_{0}^{*}, β^{*}} \to 0 and Q_{{\hat{β}}_{0}, \hat{β}} ∕ Q_{β_{0}^{*}, β^{*}} \overset{p}{\to} 1,

given $λ τ^{*} = o [\min {λ_{\max} {(Σ)}^{- 1} Δ^{- 2} s^{- 1 ∕ 2} {‖ β^{†} ‖}_{2}^{- 1}, Δ^{- 1} s^{- 1 ∕ 2} {‖ δ ‖}_{2}^{- 1}}]$ and ${‖ β^{†} ‖}_{1} = o (n^{1 - γ} Δ^{- 1})$ , where Δ is defined as in Lemma 3 and λ_max(Σ) denotes the largest eigenvalue of Σ.

That is, under mild conditions, the misclassification rate of the GSLDA classifier is of the same order as the Bayes error rate in this case.

5. Simulation study

To demonstrate the performance of the GSLDA methods, we compare them with several existing high-dimensional LDA extensions and other classification methods. The methods in comparison include the naive Bayes rule (NB), nearest shrunken centroids (NSC), sparse LDA (SLDA) [35], ℓ₁ penalized Logistic regression (PLR), penalized Fisher’s discriminant analysis (PLDA) [39], direct sparse discriminant analysis (DSDA) [26], linear programming discriminant (LPD) [6], and the ROAD [12]. In particular, the methods NSC, PLR, PLDA and DSDA are implemented with R packages pamr, glmnet, penalizedLDA and dsda, respectively. We implement the LPD method via the parametric simplex algorithm [37] as suggested in [34].

Besides the above supervised methods, there are many semi-supervised clustering (or classification) methods; see, e.g., [20, 32, 52]. We have implemented the semi-supervised spectral clustering (SSSC) method proposed in [20]. Both the original and the semi-supervised GSLDA are implemented, and the latter is denoted as GSLDA-S. We also include the GSLDA methods with the true graph, denoted as GSLDA-O (with $G$ ) and GSLDA-SO (with $\tilde{G}$ ), in the comparison. To make a fair comparison, pre-screening is not employed in the numerical studies. The Bayes rule, denoted as Oracle, is used as a benchmark.

In the simulation, we fix the dimension p = 200 and the sample size n = 200. The labels g₁, … , g_n are generated with π₁ = π₂ = 1/2 and the features are sampled from $N (μ^{(g_{i})}, Ω^{- 1})$ based on the labels. Moreover, we generate an independent dataset of sample size 2000 and remove the labels, for the semi-supervised methods. All tuning parameters are selected by 10-fold cross validation. We consider four different feature structures as follows.

Example 1. Blockwise sparse model. In this example, Σ^B is a 5 × 5 matrix with 1 for the diagonal and 0.7 for off-diagonal elements. We use 20 such blocks for the diagonal of the covariance matrix Σ and 0 for the rest, and let Ω = Σ. The group means are generated such that $μ_{j}^{(1)} = 0.5$ for j ∈ {5, 10, … , 25} and $μ_{j}^{(1)} = 0$ otherwise; and μ⁽²⁾ = −μ⁽¹⁾.

Example 2. AR(3) model. The precision matrix Ω is generated such that ω_jj = 1, and ω_jℓ = −2/3 if 1 ≤ |j − ℓ| ≤ 3 and 0 otherwise. The group means are generated such that $μ_{j}^{(1)} = 0.75$ for j ∈ {5, 10, … , 25} and $μ_{j}^{(1)} = 0$ otherwise; and μ⁽²⁾ = −μ⁽¹⁾.

Example 3. Random sparse model. The graph $G$ is generated in such a way that any two nodes are connected with probability 0.05. Based on $G$ , we generate the precision matrix Ω by setting ω_jℓ = −0.5 for all connected j and ℓ in the graph and 0 otherwise. We add c I_p, where c > 0 and I_p is an identity matrix, to Ω such that the eigenvalues are between 0 and 1. We standardize Ω so that its diagonal elements are all 1. The group means are generated in such a way that $μ_{j}^{(1)} = 0.75$ for all j ∈ S and 0 otherwise; and μ⁽²⁾ = −μ⁽¹⁾.

Example 4. Scale-free random graph. The graph is generated in a way similar to the Barabasi-Albert (BA) model. Starting from an identity matrix $L \in R^{p \times p}$ , at step i we randomly assign −0.5 to min{⌊0.05 p⌋, i − 1} entries in row i with probability Pr(i, j) ∝ #{L_ℓj ≠ 0 : 1 ≤ ℓ ≤ p}, j < i. Repeat the procedure until i = p. Then we get a lower triangular matrix. We construct Ω = L^⊺L and standardize it such that the eigenvalues are between 0 and 1. Denote the 6th to 10th most connected nodes as J. The group means are generated such that $μ_{j}^{(1)} = 0.75$ for all j ∈ J and 0 otherwise; and μ⁽²⁾ = −μ⁽¹⁾.

All four graph structures are displayed in Figure 2. The first two examples are fixed while the last two produce random graphs. Compared with the random sparse model, the scale-free random graphs are featured with hubs. For each graph structure, we repeat the simulation for 100 times and evaluate the performance, both prediction and selection accuracy, of all classification methods. Table B.1 in the Appendix displays the graph estimation accuracy for all examples.

Tables 1–4 give a summary of the performance comparison of all methods in Examples 1 and 4. In particular, misclassification rates in percentage (Error), false positives (FP) and false negatives (FN) of β estimation are computed. The misclassification rate is evaluated based on an independent test dataset of size 20,000. All metrics are averaged over 100 simulations and the numbers within parentheses are the standard errors. Both the NB and the SSSC are not considered in the comparison of variable selection, since these methods do not perform variable selection.

Table 1:

Performance comparisons of different classification methods for Example 1.

	Error	FP	FN	Size
NB	27.01 (0.18)	—	—	—
NSC	14.17 (0.11)	0.71 (0.54)	20.27 (0.13)	5.44 (0.62)
SLDA	10.28 (0.16)	5.71 (1.29)	12.53 (0.31)	18.18 (1.61)
PLR	7.17 (0.13)	14.73 (0.56)	8.1 (0.24)	31.63 (0.69)
DSDA	6.76 (0.13)	23.26 (1.53)	6.79 (0.27)	41.47 (1.71)
LPD	7.80 (0.38)	37.20 (1.97)	5.73 (0.29)	56.47 (2.17)
ROAD	6.54 (0.12)	23.45 (1.24)	6.01 (0.24)	42.44 (1.37)
PLDA	14.16 (0.10)	3.62 (1.16)	19.53 (0.16)	9.09 (1.29)
SSSC	8.11 (0.10)	—	—	—
GSLDA	5.57 (0.07)	20.48 (2.17)	7.31 (0.25)	38.17 (2.33)
GSLDA-S	4.53 (0.06)	18.79 (2.26)	0.74 (0.11)	43.05 (2.29)
GSLDA-O	4.86 (0.08)	18.55 (2.63)	0 (0)	43.55 (2.63)
GSLDA-SO	4.52 (0.07)	16.43 (1.93)	0 (0)	41.43 (1.93)

Oracle	3.27 (0.01)	0 (0)	0 (0)	25 (0)

Open in a new tab

Table 4:

Performance comparisons of different classification methods for Example 4.

	Error	FP	FN	Size
NB	32.84 (0.28)	—	—	—
NSC	22.78 (0.12)	8.62 (1.76)	48.87 (0.76)	18.75 (2.49)
SLDA	17.53 (0.27)	19.83 (1.29)	38.23 (0.67)	40.60 (1.98)
PLR	14.60 (0.21)	16.51 (0.66)	35.5 (0.5)	40.01 (1.06)
DSDA	13.48 (0.17)	33.71 (1.9)	28.18 (0.65)	64.53 (2.47)
LPD	16.87 (0.36)	46.86 (1.66)	29.17 (0.71)	76.69 (2.31)
ROAD	13.95 (0.19)	36.9 (2.01)	27.71 (0.77)	68.19 (2.69)
PLDA	22.64 (0.12)	6.6 (1.06)	51.58 (0.45)	14.02 (1.48)
SSSC	12.08 (0.21)	—	—	—
GSLDA	10.46 (0.11)	21.53 (1.7)	15.69 (0.55)	64.84 (2.14)
GSLDA-S	9.15 (0.12)	12.87 (1.29)	5.03 (0.54)	66.84 (1.57)
GSLDA-O	10.39 (0.18)	28.29 (1.73)	19.44 (0.8)	67.85 (2.43)
GSLDA-SO	9.36 (0.17)	19.87 (1.69)	5.47 (0.71)	73.05 (2.31)

Oracle	4.62 (0.02)	0 (0)	0 (0)	59 (0)

Open in a new tab

From Tables 1–4, we observe that the two plug-in extensions of LDA, namely the naive Bayes and the NSC, perform worse than ℓ₁ penalized logistic regression and other direct LDA methods under these settings. This is expected because there is substantial correlation among the features while both the plug-in extensions of LDA use diagonal estimates of Σ. In contrast, the performance of the direct LDA methods varies across the settings. For example, the DSDA has lower misclassification rates than the ROAD in most cases, while ROAD has better classification accuracy in Example 1. Utilizing the graph structures, high-dimensional LDA is further improved in GSLDA. As we can see from the results, GSLDA methods have the best performance among all methods in these four settings. In particular, the GSLDA method has lower misclassification rates than all other methods except its semi-supervised variant. Since the DSDA is the special case of the GSLDA with an empty graph, it is a good benchmark to quantify the benefit of using graph structures. In most cases, the GSLDA provides better model selection than the DSDA. Therefore, utilizing the graph structure does help us to improve the LDA classifier in high dimensions.

With respect to the semi-supervised GSLDA, due to the large amount of unlabeled data, it often has better graph estimation and yields more accurate classifiers. In fact, the semi-supervised GSLDA has the lowest misclassification rates among all methods in all cases. Furthermore, the semi-supervised GSLDA has superior model selection over the original GSLDA in most cases. This demonstrates the advantages of using unlabeled data.

We notice that models estimated by the semi-supervised GSLDA often have larger sizes, sometimes more false positives in coefficient vectors, than the original GSLDA classifiers. This is probably because the graph used in the semi-supervised GSLDA often has more edges. There are two possible reasons: (i) the true graph $\tilde{G}$ corresponding to $\tilde{Ω}$ has more edges than $G$ , and (ii) graph estimation based on unlabeled data uses a much larger training dataset which often leads to denser graphs estimate. While a denser graph estimate may recover more connections among the features, it can also result in more false edges. This effect is enhanced by the difficulty of graph estimation with unlabeled data. As a consequence, the semi-supervised GSLDA may suffer from more false positives, as shown in Examples 2 and 3. To resolve this issue, we may consider to use more conservative graph estimation for the semi-supervised GSLDA.

6. Real data analysis

In this section, we implement our methods and several other existing classifiers on two real datasets. The first dataset is a genetic dataset with very high dimensions, and the second one consists of images of handwritten digits. We estimate the graphs from labeled training data and unlabeled data. We find that GSLDA methods have a good performance in both datasets and utilizing the feature structure is beneficial.

6.1. Arcene cancer data

Nowadays, genetic diagnosis is an important tool in the clinical study and medical practice. By using the genetic information, we can estimate the potential risk of cancer for healthy people or determine cancer subtypes for patients. The Arcene dataset is a gene dataset of 88 cancer patients and 112 healthy individuals. The dataset contains 10,000 features and was originally used in the NIPS 2003 feature selection challenge (https://archive.ics.uci.edu/ml/data\-sets/Arcene). Out of the 10,000 features, 7000 are real genes while the other 3000 are noise features that have no predictive power and make the prediction harder. Besides the labeled data, there is an unlabeled dataset of 700 individuals, which is used to construct a graph for GSLDA-S. As in the previous simulation studies, we apply the GSLDA and other methods on the dataset.

The labeled data are randomly split into a training set and a test set, of sizes 150 and 50, respectively. All methods except the naive Bayes are tuned by 10-fold cross validation. The experiment is repeated 100 times and the results are summarized in Table 5.

Table 5:

Comparison of GSLDA and other methods on the Arcene dataset.

	Error	Size
NB	35.50 (0.62)	—
NSC	36.05 (0.61)	9934.46 (9.06)
SLDA	34.64 (0.73)	297 (4.17)
PLR	28.36 (0.65)	16.57 (0.90)
DSDA	28.29 (0.72)	30.96 (2.69)
LPD	31.59 (1.33)	10.95 (3.58)
ROAD	29.29 (0.64)	31.86 (3.43)
PLDA	34.36 (0.61)	9.39 (1.63)
SSSC	27.93 (0.83)	—
GSLDA	22.57 (0.70)	229.36 (6.39)
GSLDA-S	24.50 (0.68)	319.57 (8.37)

Open in a new tab

From Table 5, we can see that both GSLDA and semi-supervised GSLDA outperform other methods in prediction. Although semi-supervised GSLDA uses more data for graph estimation, its performance is inferior to GSLDA for this application, possibly due to the difficulty of graph estimation based on unlabeled data. In addition, the size of the unlabeled dataset is not substantially larger than that of the labeled dataset. Compared with PLR, DSDA and ROAD, our methods have significantly larger model sizes. This may indicate that many genes are related to each other. It is likely that those genes contribute to cancer together, and including all of them in modeling can potentially make the classifier more robust. This characteristic may also contribute to the good performance of the proposed two GSLDA methods.

6.2. Semeion handwritten digits dataset

The Semeion dataset (https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit) consists of 1593 images of handwritten digits. Each digit is in the form of a 16 × 16 grayscale image and saved as a vector of 256 features. We take a subset of the dataset that only contains digits 1 and 7, which are generally difficult to distinguish. We randomly choose 40 images for training, and 80 for graph estimation of the semi-supervised GSLDA after removing labels. The remaining 200 images are used for testing. Other settings are the same as the cancer example. Table 6 gives a summary of the results.

Table 6:

Comparison of GSLDA and other methods on the Semeion dataset.

	Error	Size
NB	13.81 (0.34)	—
NSC	15.21 (0.44)	84.74 (11.31)
SLDA	14.43 (0.67)	20.23 (2.80)
PLR	18.69 (0.88)	9.46 (0.40)
DSDA	13.76 (0.66)	16.76 (1.01)
LPD	17.15 (0.86)	15.32 (0.91)
ROAD	19.73 (0.98)	15.38 (1.25)
SSSC	13.97 (0.75)	—
GSLDA	12.65 (0.61)	28.46 (1.45)
GSLDA-S	11.23 (0.56)	33.28 (1.32)

Open in a new tab

As shown in Table 6, the semi-supervised GSLDA has excellent performance for this problem. It has the lowest misclassification rate among all methods in comparison. The original GSLDA method also has good classification accuracy for this problem. Moreover, we can see that both GSLDA methods have larger model sizes than other direct LDA methods, as in the previous analysis in Section 6.1.

7. Discussion

With many extensions in the literature, LDA can be readily applied to high-dimensional classification problems. In particular, the direct approaches of high-dimensional LDA are attractive due to their simplicity and good performance. Under the standard setting of LDA problems, we explore the relationship between the graph structure of features and the optimal discriminant vector β*. Our study shows that, by taking advantage of such structure, we can get better LDA classifiers in high dimensions. Based on this idea, we propose the GSLDA method. After investigating the overall graph structure of the Gaussian mixture population for unlabeled data, we further propose the semi-supervised GSLDA that can utilize unlabeled data. Both GSLDA methods have been evaluated on simulated and real data, which demonstrate the advantages of utilizing the graph structures. Moreover, we conclude that the performance of semi-supervised GSLDA depends on both the size of the unlabeled dataset and the graph complexity. When the graph structure is very complex, it is better to consider a conservative graph estimate for GSLDA. Finally, our focus in this paper is on binary problems. It will be useful to extend the methods for multicategory problems.

Table 2:

Performance comparisons of different classification methods for Example 2.

	Error	FP	FN	Size
NB	36.59 (0.43)	—	—	—
NSC	17.46 (0.14)	42.75 (2.16)	25.96 (0.45)	55.79 (2.48)
SLDA	14.39 (0.12)	19.28 (1.72)	17.59 (0.43)	40.69 (2.17)
PLR	7.86 (0.11)	15.83 (0.42)	20.58 (0.29)	34.25 (0.54)
DSDA	6.96 (0.09)	25.13 (1.22)	17.21 (0.38)	46.92 (1.46)
LPD	8.84 (0.69)	34.48 (1.56)	17.98 (0.48)	55.50 (1.97)
ROAD	7.42 (0.12)	25.16 (0.98)	17.36 (0.35)	46.80 (1.17)
PLDA	16.48 (0.12)	2.26 (0.48)	32.69 (0.14)	8.57 (0.57)
SSSC	9.27 (0.17)	—	—	—
GSLDA	6.60 (0.10)	25.48 (1.83)	15.41 (0.43)	49.07 (2.19)
GSLDA-S	5.56 (0.07)	34.43 (2.52)	3.33 (0.41)	70.1 (2.77)
GSLDA-O	6.19 (0.09)	27.26 (1.72)	7.37 (0.47)	58.89 (2.08)
GSLDA-SO	5.79 (0.07)	30.78 (1.94)	2.16 (0.39)	67.62 (2.31)

Oracle	3.32 (0.01)	0 (0)	0 (0)	39 (0)

Open in a new tab

Table 3:

Performance comparisons of different classification methods for Example 3.

	Error	FP	FN	Size
NB	36.86 (0.80)	—	—	—
NSC	24.16 (0.84)	29.15 (3.35)	44.78 (1.62)	50.37 (4.93)
SLDA	13.28 (0.72)	21.07 (2.29)	40.59 (1.57)	46.48 (3.87)
PLR	11.09 (0.12)	21.44 (0.56)	42.08 (0.39)	45.36 (0.75)
DSDA	10.94 (0.15)	30.32 (1.49)	38.12 (0.63)	58.20 (2.01)
LPD	13.19 (0.73)	41.67 (1.52)	39.84 (0.82)	67.83 (2.25)
ROAD	11.25 (0.15)	33.14 (1.46)	37.53 (0.55)	61.61 (1.92)
PLDA	26.31 (0.68)	22.34 (2.33)	50.89 (1.12)	37.45 (3.41)
SSSC	13.57 (0.91)	—	—	—
GSLDA	10.53 (0.10)	27.34 (1.91)	36.67 (0.85)	56.67 (2.67)
GSLDA-S	8.77 (0.08)	34.08 (2.77)	18.2 (0.72)	81.88 (3.37)
GSLDA-O	9.77 (0.08)	36.87 (2.54)	26.22 (0.78)	76.65 (3.27)
GSLDA-SO	8.91 (0.08)	35.17 (2.37)	16.31 (0.63)	84.86 (3.01)

Oracle	5.36 (0.02)	0 (0)	0 (0)	66 (0)

Open in a new tab

Acknowledgments

The authors would like to thank the Editor-in-Chief, Christian Genest, the Associate Editor, and reviewers for their valuable comments and suggestions which led to a much improved presentation. This research was supported in part by National Science Foundation Grants IIS1632951, DMS1821231, and National Institute of Health Grant R01GM126550.

Appendix A. Some comments on the GSLDA method

Appendix A.1. A graphical display of the discriminant vector decomposition

Appendix A.2. Connection between GSLDA and existing methods

We first consider the case when $G$ is a complete graph. Without loss of generality, we assume that there is a unique minimum weight, i.e., there exists an ℓ such τ_ℓ < τ_j for all j ≠ ℓ. In this case, for any $β \in R^{p}$ and v⁽¹⁾ + ⋯ + v^(p) = β, we have

\sum_{j = 1}^{p} τ_{j} {‖ v^{(j)} ‖}_{2} \geq τ_{ℓ} \sum_{j = 1}^{p} {‖ v^{(j)} ‖}_{2} \geq τ_{ℓ} {‖ β ‖}_{2} .

By taking v^(l) = β and v^(j) = 0 for all j ≠ ℓ, the regularization (5) becomes ${‖ β ‖}_{G, τ} = τ_{ℓ} {‖ β ‖}_{2}$ . Similarly, we can show the equivalence in the case where $G$ consists of K disjoint complete subgraphs.

Figure A.1: — A 3-dimensional LDA example demonstrating how marginal differences of the three features (δ₁, δ₂, δ₃) contribute to the predictive power of all features. Here ω₂₃ = ω₃₂ = 0. The terms around each node represent a decomposition of the corresponding coefficient. The gray scale of each term and the edge direction together indicate the source of the marginal differences.

Appendix B. Numerical results

Appendix B.1. Graph estimation results

To better understand the performance of our proposed GSLDA methods, we also present the graph estimation results for the methods. In particular, we compare the graph estimation based on both labeled data (for supervised GSLDA) and unlabeled data (for semi-supervised GSLDA) with the true graphs, within-class graph $G$ and overall graph $\tilde{G}$ . The accuracy metrics include false positives (TP) and false positives (FP).

Table B.1:

Graph estimation accuracy for all examples in the simulations. The graphs are estimated with labeled data (L) after centering, or with unlabeled data (U). The former estimation is compared with $G$ , and the latter is compared with both $G$ and $\tilde{G}$ . The results are averaged over 100 repetitions and the standard errors are provided in the parentheses.

Graph Type	Data	TP	FP	Size	True Size
Block Sparse	L	51.54 (0.25)	6.86 (0.48)	58.4 (0.55)	$G : 100 (0)$
	U	100 (0)	82.96 (0.66)	182.96 (0.66)	$G : 100 (0)$
	U	176.48 (0.46)	6.48 (0.38)	182.96 (0.66)	$\tilde{G} : 600 (0)$

AR(3)	L	468.02 (1.31)	69.64 (1.24)	537.66 (1.34)	$G : 1188 (0)$
	U	1178.04 (0.38)	76.72 (0.87)	1254.76 (1.01)	$G : 1188 (0)$
	U	1235.76 (0.75)	19 (0.61)	1254.76 (1.01)	$\tilde{G} : 2508 (0)$

Random Sparse	L	353.14 (2.02)	69.34 (1.25)	422.48 (1.73)	$G : 818 (0)$
	U	814.52 (0.18)	72.74 (1.04)	887.26 (1.12)	$G : 818 (0)$
	U	866.14 (0.87)	21.12 (0.70)	887.26 (1.12)	$\tilde{G} : 2426 (0)$

Scale-Free	L	374.92 (1.37)	32.44 (0.92)	407.36 (1.48)	$G : 776 (0)$
	U	709.88 (0.73)	103.5 (1.00)	813.38 (1.18)	$G : 776 (0)$
	U	799.08 (1.05)	14.3 (0.53)	813.38 (1.18)	$\tilde{G} : 3564 (0)$

Open in a new tab

Appendix B.2. Additional simulation results

The misclassification rates may not reflect the comprehensive performance of classification models, especially when the classes are unbalanced. Thus we present the receiver operating characteristic (ROC) curve for the classification models. Besides the balanced class setting as in the main text, we also consider an unbalanced class setting in which Class-0 accounts for 80% of the whole dataset. As we can see from Figures B.1 and B.2, our methods still outperforms other methods in terms of higher sensitivities at each specificity level.

Figure B.1: — ROC Curve under the balanced setting for the four examples. The proportion of Class-0 sample is 50%. The ROC curve is computed based on 100 repetitions.

Figure B.2: — ROC Curve under the unbalanced setting for the four examples. In particular, the proportion of Class-0 sample is 80%. The ROC curve is computed based on 100 repetitions.

Appendix C. Proofs to the theoretical results

Appendix C.1. Proof of Proposition 1

The random variable X can be represented as X = ξZ₁ + (1 − ξ)Z₂, where (i) ξ ~ Bin(1, π₁) is a Bernoulli random variable and (ii) Z₁ and Z₂ are from the two population components, respectively. Moreover, ξ, Z₁, and Z₂ are mutually independent. We have var(Z₁) = var(Z₂) = Σ, EZ₁ = μ⁽¹⁾, and EZ₂ = μ⁽²⁾. Then E(X) = π₁μ⁽¹⁾ + π₂μ⁽²⁾ and

E ({XX}^{⊺}) = E {ξ^{2} Z_{1} Z_{1}^{⊺} + {(1 - ξ)}^{2} Z_{2} Z_{2}^{⊺} + ξ (1 - ξ) Z_{1} Z_{2}^{⊺} + ξ (1 - ξ) Z_{2} Z_{1}^{⊺}} = π_{1} E (Z_{1} Z_{1}^{⊺}) + π_{2} E (Z_{2} Z_{2}^{⊺}) .

Thus the overall covariance matrix is var(X) = Σ + π₁π₂δδ^⊺, where δ = μ⁽¹⁾ − μ⁽²⁾.

Now we verify the inverse matrix of var(X), i.e., the overall precision matrix of the mixture distribution. By setting c = π₁π₂/(1 + π₁π₂δ^⊺Σ⁻¹δ), we have

(Σ^{- 1} - c Σ^{- 1} {δ δ}^{⊺} Σ^{- 1}) (Σ + π_{1} π_{2} {δ δ}^{⊺}) = I + π_{1} π_{2} Σ^{- 1} {δ δ}^{⊺} - c Σ^{- 1} {δ δ}^{⊺} - π_{1} π_{2} c Σ^{- 1} {δ δ}^{⊺} Σ^{- 1} {δ δ}^{⊺} = I .

Denote β* = Σ⁻¹δ. Then we have var(X)⁻¹ = Σ⁻¹ − cβ*β*^⊺. □

Appendix C.2. Proof of Theorem 1

Before the proof, we introduce a lemma from [7]. The proof is omitted.

Lemma 4. Let ξ₁, … , ξ_n be independent random variables with mean zero. Suppose that there exists some t > 0 and ${\overset{‒}{B}}_{n}$ such that $Σ_{k = 1}^{n} E (ξ_{k}^{2} e^{t ∣ ξ_{k} ∣}) \leq {\overset{‒}{B}}_{n}^{2}$ . Set C_t = t + t⁻¹. Then uniformly for $x \in (0, {\overset{‒}{B}}_{n}]$ ,

\Pr (\sum_{k = 1}^{n} ξ_{k} \geq C_{t} {\overset{‒}{B}}_{n} x) \leq \exp (- x^{2}) .

Denote $ξ_{1} = {‖ {\tilde{S}}_{AA} - {\tilde{Σ}}_{AA} ‖}_{\infty}$ , $ξ_{2} = {‖ {\tilde{S}}_{AA}^{- 1} - {\tilde{Σ}}_{AA}^{- 1} ‖}_{\infty}$ , and $ξ = {‖ {\tilde{S}}_{A^{C} A} {\tilde{S}}_{AA}^{- 1} - {\tilde{Σ}}_{A^{C} A} {\tilde{Σ}}_{AA}^{- 1} ‖}_{\infty}$ . With simple calculations, one can show that for all $β \in R^{p}$ ,

\min_{β_{0} \in R} \sum_{i = 1}^{n} {(y_{i} - β_{0} - x_{i}^{⊺} β)}^{2} = \sum_{i = 1}^{n} {(y_{i} - {\dot{x}}_{i}^{⊺} β)}^{2},

where ${\dot{x}}_{i} = x_{i} - (x_{1} + \dots + x_{n}) ∕ n$ is the centralized feature vector. Thus the loss function of GSLDA in (4) is equivalent to $Σ_{i = 1}^{n} {(y_{i} - {\dot{x}}_{i}^{⊺} β)}^{2} ∕ n + λ {‖ β ‖}_{G, τ}$ . In the rest of our proof, we assume the sample X has been centralized. Then the GSLDA formulation (4) becomes

\hat{β} = \underset{β \in R^{p}}{argmin} {‖ Y - X β ‖}_{2}^{2} ∕ n + λ {‖ β ‖}_{G, τ} .

(C.1)

Under the assumption (A1), we can define

\hat{γ} = \underset{γ \in R^{s}}{argmin} {‖ Y - X_{A} γ ‖}_{2}^{2} ∕ n + λ {‖ γ ‖}_{G_{A}, τ_{A}},

(C.2)

where $G_{A}$ denotes the subgraph of $G$ corresponding to A. If we can show that (i) all elements of $\hat{γ}$ are non-zero; and (ii) $\hat{β}$ with ${\hat{β}}_{A} = \hat{γ}$ and ${\hat{β}}_{A^{C}} = 0$ solves (C.1); then, GSLDA estimation recovers all significant features accurately.

We first show statement (i). By Section 4.6 of [31], the formulation (C.2) is equivalent to $\hat{γ} = Σ_{j \in A} {\hat{u}}^{(j)}$ where

{{\hat{u}}^{(j)} : j \in A} \underset{u^{(j)} \in R^{s} : supp (u^{(j)}) \subseteq N^{(j)} \cap A, j \in A}{argmin} {‖ Y - \sum_{j \in A} X_{A} u^{(j)} ‖}_{2}^{2} ∕ n + λ \sum_{j \in A} τ_{j} {‖ u^{(j)} ‖}_{2} .

Since this is a convex optimization problem, any solution {u^(j) : j ∈ A} satisfies the KKT conditions [4], which are for all j ∈ A, either

u^{(j)} \neq 0 and 2 X_{N (j)}^{⊺} (Y - X_{A} γ) ∕ n = λ τ_{j} u_{N^{(j)}}^{(j)} ∕ {‖ u^{(j)} ‖}_{2},

u^{(j)} = 0 and 2 {‖ X_{N^{(j)}}^{⊺} (Y - X_{A} γ) ‖}_{2} ∕ n \leq λ τ_{j},

where γ = Σ_j∈A u^(j). Thus we have ${‖ X_{A}^{⊺} (Y - X_{A} \hat{γ}) ‖}_{\infty} ∕ n \leq λ τ^{*} ∕ 2$ , and we can write $\hat{γ}$ as $\hat{γ} = {\tilde{S}}_{AA}^{- 1} ({\hat{δ}}_{A} + λ τ^{*} t_{A} ∕ 2)$ , where $t_{A} \in R^{s}$ satisfies ||t_A||_∞ ≤ 1. We have

{‖ \hat{γ} - β_{A}^{†} ‖}_{\infty} = {‖ ({\tilde{S}}_{AA}^{- 1} - {\tilde{Σ}}_{AA}^{- 1}) δ_{A} - {\tilde{S}}_{AA}^{- 1} (δ_{A} - {\hat{δ}}_{A}) + {\tilde{S}}_{AA}^{- 1} λ τ^{*} t_{A} ∕ 2 ‖}_{\infty} \leq {‖ δ_{A} ‖}_{\infty} ξ_{2} + (φ + ξ_{2}) {‖ {\hat{δ}}_{A} - δ_{A} ‖}_{\infty} + λ τ^{*} (φ + ξ_{2}) ∕ 2 \leq ξ_{2} ({‖ δ_{A} ‖}_{\infty} + {‖ {\hat{δ}}_{A} - δ_{A} ‖}_{\infty} + λ τ^{*} ∕ 2) + φ ({‖ {\hat{δ}}_{A} - δ_{A} ‖}_{\infty} + λ τ^{*} ∕ 2) \leq \frac{φ^{2} ξ_{1}}{1 - φ ξ_{1}} ({‖ δ_{A} ‖}_{\infty} + {‖ {\hat{δ}}_{A} - δ_{A} ‖}_{\infty} + λ τ^{*} ∕ 2) + φ ({‖ {\hat{δ}}_{A} - δ_{A} ‖}_{\infty} + λ τ^{*} ∕ 2) \equiv L_{1},

in which the second inequality holds for sufficiently large n because φξ₁ ≤ 1 and ξ₂ ≤ (1 − φξ₁)⁻¹ φ²ξ₁. If ξ₁ ≤ ϵ and ${‖ {\hat{δ}}_{A} - δ_{A} ‖}_{\infty} \leq ϵ$ , then L₁ = O(ϵ) + λτ*φ/2 > 0, which provides (i). By Lemma 4, the statement (i) is true with probability at least 1 − 2s² exp(−a₁nϵ²/s²) − 2sexp(−a₂nϵ²), for some positive a₁ and a₂.

Now we prove statement (ii). The formulation (C.1) is equivalent to $\hat{β} = Σ_{j = 1}^{p} {\hat{v}}^{(j)}$ , where

{{\hat{v}}^{(1)}, \dots, {\hat{v}}^{(p)}} = \underset{v^{(j)} : supp (v^{(j)}) \subseteq N^{(j)}, 1 \leq j \leq p}{argmin} \frac{1}{n} {‖ Y - \sum_{j = 1}^{p} {Xv}^{(j)} ‖}_{2}^{2} + λ \sum_{j = 1}^{p} {‖ v^{(j)} ‖}_{2} .

(C.3)

This is also a convex optimization problem and the KKT conditions of formulation (C.3) are for all j ∈ {1, … , p}, either

v^{(j)} \neq 0 and 2 X_{N^{(j)}}^{⊺} (Y - X β) ∕ n = λ τ_{j} v_{N^{(j)}}^{(j)} ∕ {‖ v^{(j)} ‖}_{2},

(C.4)

v^{(j)} = 0 and 2 {‖ X_{N^{(j)}}^{⊺} (Y - X β) ‖}_{2} ∕ n \leq λ τ_{j},

(C.5)

where β = v⁽¹⁾ + ⋯ + v^(p). Let v^(j) = 0 for all j ∈ A^C, and $v_{A}^{(j)} = u^{(j)}, v_{A^{C}}^{(j)} = 0$ for all j ∈ A. Then $β_{A} = \hat{γ}$ and β_A^C = 0.

For j ∈ A, (C.4) holds owing to the definition of $\hat{γ}$ . For j ∈ A^C, $2 {‖ X_{N^{(j)}}^{⊺} (Y - X β) ‖}_{2} ∕ n \leq 2 \sqrt{∣ N^{(j)} ∣} {‖ X_{N^{(j)}}^{⊺} (Y - X β) ‖}_{\infty} ∕ n$ . Denote $η = {‖ \hat{δ} - δ ‖}_{\infty}$ if ${‖ \hat{δ} - δ ‖}_{\infty} \leq ϵ$ and ${‖ {\tilde{S}}_{A^{C} A} - {\tilde{Σ}}_{A^{C} A} ‖}_{\infty} \leq ϵ$ , then

{‖ X_{A^{C}}^{⊺} (Y - X_{A} \hat{γ}) ‖}_{\infty} ∕ n = {‖ {\tilde{S}}_{A^{C} A} {\tilde{S}}_{AA}^{- 1} ({\hat{δ}}_{A} + λ τ^{*} t_{A} ∕ 2) - {\hat{δ}}_{A^{C}} ‖}_{\infty} \leq ({‖ δ_{A} ‖}_{\infty} + 1 + ϵ + κ + λ τ^{*} ∕ 2) ϵ + λ τ^{*} κ ∕ 2 \leq O (ϵ) + λ τ_{*} ∕ 2 .

By Lemma 4, the statement (ii) is true with probability at least 1 − 2ps exp(−a₁nϵ²/s²)−2p exp(−a₂nϵ²). By taking $ϵ = \sqrt{\ln p ∕ n}$ , the active set is recovered and ${‖ {\hat{β}}_{A} - β_{A}^{†} ‖}_{\infty} \leq O (\sqrt{\ln p ∕ n})$ with probability at least 1 − 2s² exp(−a₁nϵ²/s²)−2ps exp(−a₁nϵ²/s²) − 2p exp(−a₂nϵ²) = 1 − O(p^−C₁) for some C₁ > 0. □

Appendix C.3. Proof of Theorem 2

The proof uses the following lemma from [30].

Lemma 5. Denote $M$ a subspace of $R^{p}$ and $M^{⊥}$ its orthogonal complement. For a regularized estimation problem

\hat{θ} = \underset{θ \in R^{p}}{\arg \min} L (θ) + λ R (θ),

where

(i)
R is a norm and is decomposable with respect to ( $M, M^{⊥}$ ), i.e., R(θ + η) = R(θ) + R(η) for all $θ \in M, η \in M^{⊥}$ ;
(ii)
L is convex and differentiable, and satisfies restricted strong convex condition with curvature κ_L, i.e., $δ L (θ^{*}, Δ) = L (θ^{*} + Δ) - L (θ^{*}) - \nabla L (θ^{*})^{⊺} Δ \geq κ_{L} ∥ Δ ∥_{2}^{2}$ for some θ*, for all Δ such that $R (Δ_{M^{⊥}}) \leq 3 R (Δ_{M}) + 4 R (θ_{M^{⊥}}^{*})$ .

Let λ ≥ 2R*{∇L(θ*)}, where R* denote the dual norm of R, then any solution $\hat{θ}$ to the problem satisfies

{‖ \hat{θ} - θ^{*} ‖}_{2} \leq 9 λ^{2} ψ (M) ∕ κ_{L}^{2} + 4 λ R^{*} (θ_{M^{⊥}}^{*}) ∕ κ_{L},

where $ψ (M) = {sup}_{u \in M ∕ {0}} R (u) ∕ ∥ u ∥_{2}$ .

Proof. In the GSLDA formulation, the loss function is $L (β_{0}, β) = ∥ y - β_{0} 1 - X β ∥_{2}^{2} ∕ n$ , and the regularization is $R (β) = ∥ β ∥_{G, τ}$ . It has been shown in [31] that R is a norm and its dual norm is $R^{*} (u) = {max}_{1 \leq j \leq p} τ_{j}^{- 1} ∥ u_{N (j)} ∥_{2}$ . When we take $τ_{j} = \sqrt{∣ N^{(j)} ∣}, R^{*} (u) \leq {max}_{j} ∥ u_{N^{(j)}} ∥_{\infty} = ∥ u ∥_{\infty}$ .

For some ϵ > 0 denote the event $X = {∥ \hat{δ} - δ ∥_{\infty} \leq \in, ∥ ∣ \tilde{S} ._{, A} - \tilde{Σ} ._{, A} ∣ ∥_{\infty} \leq \in}$ . Then by Lemma 4, $Pr (X) \geq 1 - 2 p exp (- a_{2} n ϵ^{2}) - 2 p s exp (- a_{1} n ϵ^{2})$ . Under the event $X$ ,

{‖ \nabla L (β^{†}) ‖}_{\infty} = {‖ 2 n^{- 1} X^{⊺} (y - X β^{†}) ‖}_{\infty} = {‖ 2 (\hat{δ} - \tilde{S} β^{†} ‖}_{\infty} = {‖ 2 (\hat{δ} - δ) - 2 (\tilde{S} - \tilde{Σ}) β^{†} ‖}_{\infty} \leq 2 {‖ \hat{δ} - δ ‖}_{\infty} + 2 {‖ {(\tilde{S} - \tilde{Σ})}_{\cdot, A} β_{A}^{†} ‖}_{\infty} \leq 2 {‖ \hat{δ} - δ ‖}_{\infty} + 2 {‖ {(\tilde{S} - \tilde{Σ})}_{\cdot, A} ‖}_{\infty} {‖ β_{A}^{†} ‖}_{1} \leq 2 ϵ + 2 {‖ β_{A}^{†} ‖}_{1} ϵ .

We take $\in = C_{2} \sqrt{ln p ∕ n}$ where C₂ > (a₁ ∧ a₂)⁻¹. Then $λ \geq 2 R^{*} (\nabla L (β^{†}))$ . Under the event $X$ with ϵ ≤ cσ for some c > 0, for sufficiently large n, we have $Δ^{⊺} \tilde{S} Δ \geq Δ^{⊺} \tilde{Σ} Δ - ∣ Δ^{⊺} (\tilde{S} - \tilde{Σ}) Δ ∣ \geq σ ∥ Δ ∥_{2}^{2} ∕ 2$ for $Δ \in C (A)$ . Thus $δ L (β^{*}, Δ) \geq 2 Δ^{⊺} \tilde{S} Δ \geq σ {‖ Δ ‖}_{2}^{2}$ for all $Δ \in C (A)$ .

We take $M = {β \in R^{p} : β_{A^{C}} = 0}$ . Then $β^{†} \in M$ and $β_{M^{⊥}}^{†} = 0$ . Moreover,

ψ (M) = \sup_{β \in M} \frac{R (β)}{{‖ β ‖}_{2}} \sup_{β_{A} C = 0} \frac{\min_{Σ v^{(j)} = β} \sum τ_{j} {‖ v^{(j)} ‖}_{2}}{{‖ β ‖}_{2}} \leq \sup_{β_{A} C = 0} \frac{\sum_{j \in A} {\tilde{τ}}_{j} ∣ β_{j} ∣}{{‖ β_{A} ‖}_{2}} \leq τ^{*} \sqrt{s} .

Therefore, by Lemma 5, we have

{‖ \hat{β} - β^{†} ‖}_{2}^{2} \leq 9 λ^{2} s τ^{* 2} ∕ σ^{2},

with probability at least 1 − 2ps exp(−a₁nϵ²) − 2p exp(−a₂nϵ²) ≥ 1 − sp^−C₃ where C₃ = C₂(a₁ ∨ a₂) − 1 > 0. □

Appendix C.4. Proof of Theorem 3

We use the same notations as in the proof above. Without loss of generality, we assume μ⁽¹⁾ + μ⁽²⁾ = 0, then μ⁽¹⁾ = δ/2, μ⁽²⁾ = −δ/2, and $β_{0}^{†} = 0$ . According to Proposition 2, we have

Q ({\hat{β}}_{0}, \hat{β}) ∕ Q (β_{0}^{†}, β^{†}) - 1 = [{Φ (\frac{- {\hat{β}}_{0} - {\hat{β}}^{⊺} μ^{(1)}}{\sqrt{{\hat{β}}^{⊺} Σ \hat{β}}}) - Φ (\frac{- β^{† T} μ^{(1)}}{\sqrt{β^{† T} Σ β^{†}}})} + {Φ (\frac{{\hat{β}}_{0} - {\hat{β}}^{⊺} μ^{(2)}}{\sqrt{{\hat{β}}^{⊺} Σ \hat{β}}}) - Φ (\frac{β^{† T} μ^{(2)}}{\sqrt{β^{† T} Σ β^{†}}})}] \times {Φ (\frac{- β^{† T} μ^{(1)}}{\sqrt{β^{† T} Σ β^{†}}}) + Φ (\frac{β^{† T} μ^{(2)}}{\sqrt{β^{† T} Σ β^{†}}})}^{- 1} \leq \max {Φ (\frac{- {\hat{β}}_{0} - {\hat{β}}^{⊺} δ ∕ 2}{\sqrt{{\hat{β}}^{⊺} Σ \hat{β}}}) ∕ Φ (\frac{- β^{† T} δ ∕ 2}{\sqrt{β^{† T} Σ β^{†}}}) - 1, Φ (\frac{{\hat{β}}_{0} - {\hat{β}}^{⊺} δ ∕ 2}{\sqrt{{\hat{β}}^{⊺} Σ \hat{β}}}) ∕ Φ (\frac{- β^{† T} δ ∕ 2}{\sqrt{β^{† T} Σ β^{†}}}) - 1} \equiv \max {R^{(1)}, R^{(2)}} .

We will use the following property of standard Gaussian distribution function [6]:

∣ Φ (x_{0} + r) ∕ Φ (x_{0}) - 1 ∣ \leq c_{1} ∣ r ∣ (∣ x_{0} ∣ + 1) \exp (c_{2} ∣ x_{0} r ∣) .

(C.6)

For k ∈ {1, 2}, let

r^{(k)} = ∣ ({\hat{β}}_{0} + {\hat{β}}^{⊺} μ^{(k)}) ∕ \sqrt{β^{⊺} Σ \hat{β}} - β^{† T} μ^{(k)} ∕ \sqrt{β^{† T} Σ β^{†}} ∣ .

Since $β^{† T} δ ∕ \sqrt{β^{† T} Σ β^{†}} = Δ^{1 ∕ 2}$ , it suffices to verify the orders of r^(k) and Δ.

According to Theorem 2, $∥ \hat{β} - β^{†} ∥_{2} \leq 3 λ τ^{*} \sqrt{s} ∕ σ$ with probability going to 1. Moreover, by the definition of β^†, we have β^† = 4/(4 + Δ)β*, and β^†^T Σβ^† = 16Δ/(4 + Δ)². Since

∣ {\hat{β}}^{⊺} Σ \hat{β} - β^{† T} Σ β^{†} ∣ \leq ∣ {(\hat{β} - β^{†})}^{⊺} Σ (\hat{β} - β^{†}) ∣ + 2 ∣ {(\hat{β} - β^{†})}^{⊺} Σ β^{†} ∣ \leq λ_{\max} (Σ) {‖ \hat{β} - β^{†} ‖}_{2}^{2} + 2 λ_{\max} (Σ) {‖ \hat{β} - β^{†} ‖}_{2} {‖ β^{†} ‖}_{2} \leq 3 λ_{\max} (Σ) {‖ \hat{β} - β^{†} ‖}_{2} {‖ β^{†} ‖}_{2}, \leq 9 λ_{\max} (Σ) λ τ^{*} \sqrt{s} {‖ β^{†} ‖}_{2} ∕ σ .

for sufficiently large n, we have

∣ {({\hat{β}}^{⊺} Σ \hat{β})}^{- 1 ∕ 2} - {(β^{† T} Σ β^{†})}^{- 1 ∕ 2} ∣ = ∣ \frac{{({\hat{β}}^{⊺} Σ \hat{β})}^{1 ∕ 2} - {(β^{† T} Σ β^{†})}^{1 ∕ 2}}{{({\hat{β}}^{⊺} Σ \hat{β})}^{1 ∕ 2} {(β^{† T} Σ β^{†})}^{1 ∕ 2}} ∣ = ∣ \frac{{\hat{β}}^{⊺} Σ \hat{β} - β^{† T} Σ β^{†}}{{({\hat{β}}^{⊺} Σ \hat{β})}^{1 ∕ 2} \cdot {(β^{† T} Σ β^{†})}^{1 ∕ 2} [{({\hat{β}}^{⊺} Σ \hat{β})}^{1 ∕ 2} + {(β^{† T} Σ β^{†})}^{1 ∕ 2}]} ∣ \leq ∣ \frac{{\hat{β}}^{⊺} Σ \hat{β} - β^{† T} Σ β^{†}}{3 ∕ 4 {(β^{† T} Σ β^{†})}^{3 ∕ 2}} ∣ \leq \frac{{(4 + Δ)}^{3}}{4 Δ^{3 ∕ 2}} λ_{\max} (Σ) λ τ^{*} \sqrt{s} {‖ β^{†} ‖}_{2} ∕ σ,

in which the first inequality holds because ${\hat{β}}^{⊺} Σ \hat{β} \geq β^{† T} Σ β^{†} ∕ 2$ for sufficiently large n. Moreover, under $X$ we have,

∣ ({\hat{β}}_{0} + {\hat{β}}^{⊺} δ ∕ 2) - (β^{† T} δ ∕ 2) ∣ \leq ∣ {(\hat{β} - β^{†})}^{⊺} δ ∣ ∕ 2 + ∣ - {\overset{‒}{x}}^{⊺} \hat{β} ∣ \leq {‖ \hat{β} - β^{†} ‖}_{2} {‖ δ ‖}_{2} ∕ 2 + ∣ {\overset{‒}{x}}^{⊺} β^{†} ∣ + ∣ {\overset{‒}{x}}^{⊺} (\hat{β} - β^{†}) ∣ \leq 3 λ τ^{*} \sqrt{s} {‖ δ ‖}_{2} ∕ (2 σ) + {‖ β^{†} ‖}_{1} ϵ .

Therefore,

r^{(1)} \leq ∣ ({\hat{β}}_{0} + {\hat{β}}^{⊺} δ ∕ 2) {{({\hat{β}}^{⊺} Σ \hat{β})}^{- 1 ∕ 2} - {(β^{†} Σ β^{†})}^{- 1 ∕ 2}} ∣ + ∣ ({\hat{β}}_{0} + {\hat{β}}^{⊺} δ ∕ 2) - β^{† T} δ ∕ 2 ∣ ∕ {(β^{† T} Σ β^{†})}^{1 ∕ 2} \leq \frac{{(4 + Δ)}^{2}}{2 Δ^{1 ∕ 2}} λ τ^{*} λ_{\max} (Σ) \sqrt{s} {‖ β^{†} ‖}_{2} ∕ σ + \frac{4 + Δ}{4 Δ^{1 ∕ 2}} {2 λ τ^{*} \sqrt{s} {‖ δ ‖}_{2} ∕ σ + {‖ β^{†} ‖}_{1} ϵ} .

Using the property (C.6), then we have

R^{(1)} = ∣ \frac{Φ (- Δ^{1 ∕ 2} ∕ 2 + r^{(1)}}{Φ (- Δ^{1 ∕ 2} ∕ 2)} - 1 ∣ \leq c_{1} ∣ r^{(1)} ∣ (Δ^{1 ∕ 2} ∕ 2 + 1) \exp (c_{2} r^{(1)} Δ^{1 ∕ 2} ∕ 2) = O {r^{(1)} Δ^{1 ∕ 2} \exp (c_{2} r^{(1)} Δ^{1 ∕ 2} ∕ 2)} .

Since $Δ^{2} λ τ^{*} λ_{\max} (Σ) \sqrt{s} {‖ β^{†} ‖}_{2} ∕ σ \to 0$ , $Δ λ τ^{*} \sqrt{s} {‖ δ ‖}_{2} ∕ σ \to 0$ , and $Δ n^{γ - 1} {‖ β^{†} ‖}_{1} \to 0$ , we have r⁽¹⁾Δ^1/2 → 0 and thus $R^{(1)} \overset{p}{\to} 0$ . Similarly, we can show $R^{(2)} \overset{p}{\to} 0$ , which proves the theorem.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].Bickel PJ, Levina E, Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli 10 (2004) 989–1010. [Google Scholar]
[2].Bishop CM, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Secaucus, NJ, 2006. [Google Scholar]
[3].Bondell HD, Reich BJ, Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar, Biometrics 64(2008) 115–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
[4].Boyd S, Vandenberghe L, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Google Scholar]
[5].Cai D, He X, Han J, Semi-supervised discriminant analysis, In: 2007 IEEE 11th International Conference on Computer Vision, IEEE, 1–7. [Google Scholar]
[6].Cai T, Liu W, A direct estimation approach to sparse linear discriminant analysis, J. Amer. Statist. Assoc 106 (2011) 1566–1577. [Google Scholar]
[7].Cai T, Liu W, Luo X, A constrained ℓ₁ minimization approach to sparse precision matrix estimation, J. Amer. Statist. Assoc 106 (2011) 594–607. [Google Scholar]
[8].Chen J, Chen Z, Extended bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]
[9].Chen S, Witten DM, Shojaie A, Selection and estimation for mixed graphical models, Biometrika 102 (2014) 47–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
[10].Clemmensen L, Hastie T, Witten D, Ersbpll B, Sparse discriminant analysis, Technometrics 53 (2011) 406–413. [Google Scholar]
[11].Fan J, Fan Y, High dimensional classification using features annealed independence rules, Ann. Statist 36 (2008) 2605–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Fan J, Feng Y, Tong X, A road to classification in high dimensional space: The regularized optimal affine discriminant, J. R. Stat. Soc. Ser. B Stat. Methodol. 74 (2012) 745–771. [DOI] [PMC free article] [PubMed] [Google Scholar]
[13].Fisher RA, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179–188. [Google Scholar]
[14].Friedman J, Hastie T, Tibshirani RJ, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (2008) 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
[15].Hand DJ, Classifier technology and the illusion of progress, Statist. Sci 21 (2006) 1–14. [Google Scholar]
[16].Hastie T, Tibshirani RJ, Buja A, Flexible discriminant analysis by optimal scoring, J. Amer. Statist. Assoc 89 (1994) 1255–1270. [Google Scholar]
[17].Hastie T, Tibshirani RJ, Friedman J, Elements of Statistical Learning Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009. [Google Scholar]
[18].Kim S, Pan W, Shen X, Network-based penalized regression with application to genomic data, Biometrics 69 (2013) 582–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Li C, Li H, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 (2008) 1175–1182. [DOI] [PubMed] [Google Scholar]
[20].Liu B, Shen X, Pan W, Semi-supervised spectral clustering with application to detect population stratification, Frontiers in Genetics 4 (2013) 215. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Liu Y, Yuan M, Reinforced multicategory support vector machines, J. Comput. Graph. Statist 20 (2011) 901–919. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Luo S, Chen Z, Edge detection in sparse gaussian graphical models, Comput. Statist.Data Anal. 70 (2014) 138–152. [Google Scholar]
[23].Luo S, Chen Z, Sequential lasso cum ebic for feature selection with ultra-high dimensional feature space, J. Amer. Statist. Assoc 109 (2014) 1229–1240. [Google Scholar]
[24].Mai Q, Yang Y, Zou H, Multiclass sparse discriminant analysis, arXiv preprint arXiv:1504.05845 (2015). [Google Scholar]
[25].Mai Q, Zou H, A note on the connection and equivalence of three sparse linear discriminant analysis methods, Technometrics 55 (2013) 243–246. [Google Scholar]
[26].Mai Q, Zou H, Yuan M, A direct approach to sparse discriminant analysis in ultra-high dimensions, Biometrika 99 (2012) 29–42. [Google Scholar]
[27].Meier L, van de Geer S, Buhlmann P, The group lasso for logistic regression, J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 53–71. [Google Scholar]
[28].Meinshausen N, Buhlmann P, High-dimensional graphs and variable selection with the lasso, Ann. Statist 34 (2006) 1436–1462. [Google Scholar]
[29].Min W, Liu J, Zhang S, Network-regularized sparse logistic regression models for clinical risk prediction and biomarker discovery, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 15 (2018) 944–953. [DOI] [PubMed] [Google Scholar]
[30].Negahban SN, Ravikumar P, Wainwright MJ, Yu B, A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers, Statist. Sci 27 (2012) 538–557. [Google Scholar]
[31].Obozinski G, Jacob L, Vert J-P, Group lasso with overlaps: The latent group lasso approach, arXiv preprint arXiv:1110.0413 (2011). [Google Scholar]
[32].Pan W, Shen X, Penalized model-based clustering with application to variable selection, J. Machine Learn. Res. 8 (2007) 1145–1164. [Google Scholar]
[33].Pan W, Xie B, Shen X, Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2010) 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
[34].Pang H, Liu H, Vanderbei R, The fastclime package for linear programming and large-scale precision matrix estimation in R, J. Machine Learn. Res. 15 (2014) 489–493. [PMC free article] [PubMed] [Google Scholar]
[35].Shao J, Wang Y, Deng X, Wang S, Sparse linear discriminant analysis by thresholding for high dimensional data, Ann. Statist 39 (2011) 1241–1265. [Google Scholar]
[36].Tibshirani RJ, Hastie T, Narasimhan B, Chu G, Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Nat. Acad. Sci 99 (2002) 6567–6572. [DOI] [PMC free article] [PubMed] [Google Scholar]
[37].Vanderbei RJ, Linear Programming: Foundations and Extensions, 4th ed., Springer, 2014. [Google Scholar]
[38].Voorman A, Shojaie A, Witten D, Graph estimation with joint additive models, Biometrika 101 (2013) 85–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
[39].Witten DM, Tibshirani RJ, Penalized classification using Fisher’s linear discriminant, J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (2011) 753–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
[40].Wu M, Zhu L, Feng X, Network-based feature screening with applications to genome data, Ann. Appl. Statist 12 (2018) 1250–1270. [Google Scholar]
[41].Wu MC, Zhang L, Wang Z, Christiani DC, Lin X, Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection, Bioinformatics 25 (2009) 1145–1151. [DOI] [PMC free article] [PubMed] [Google Scholar]
[42].Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J, Feature grouping and selection over an undirected graph, ACM, 2012, 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]
[43].Yang Y, Zou H, A fast unified algorithm for solving group-lasso penalize learning problems, Stat. Comput 25 (2015) 1129–1141. [Google Scholar]
[44].Yu G, Liu Y, Sparse regression incorporating graphical structure among predictors, J. Amer. Statist. Assoc 111 (2016) 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]
[45].Yuan M, Lin Y, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B Stat. Methodol. 68 (2006) 49–67. [Google Scholar]
[46].Yuan M, Lin Y, Model selection and estimation in the Gaussian graphical model, Biometrika 94 (2007) 19–35. [Google Scholar]
[47].Zhang C, Liu Y, Multicategory large-margin unified machines, J. Machine Learn. Res. 14 (2013) 1349–1386. [PMC free article] [PubMed] [Google Scholar]
[48].Zhang C, Liu Y, Wang J, Zhu H, Reinforced angle-based multicategory support vector machines, J. Comput. Graph. Statist 25 (2016) 806–825. [DOI] [PMC free article] [PubMed] [Google Scholar]
[49].Zhang W, Wan Y-W, Allen GI, Pang K, Anderson ML, Liu Z, Molecular pathway identification using biological network-regularized logistic models, BMC Genomics 14 (2013) S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
[50].Zhao P, Yu B, On model selection consistency of lasso, J. Machine Learn. Res. 7 (2006) 2541–2563. [Google Scholar]
[51].Zhao S, Shojaie A, A significance test for graph-constrained estimation, Biometrics 72 (2016) 484–493. [DOI] [PMC free article] [PubMed] [Google Scholar]
[52].Zhou H, Pan W, Shen X, Penalized model-based clustering with unconstrained covariance matrices, Electron. J. Statist 3 (2009) 1473–1496. [DOI] [PMC free article] [PubMed] [Google Scholar]
[53].Zhu Y, Shen X, Pan W, Simultaneous grouping pursuit and feature selection over an undirected graph, J. Amer. Statist. Assoc 108 (2013) 713–725. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] [1].Bickel PJ, Levina E, Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli 10 (2004) 989–1010. [Google Scholar]

[R2] [2].Bishop CM, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Secaucus, NJ, 2006. [Google Scholar]

[R3] [3].Bondell HD, Reich BJ, Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar, Biometrics 64(2008) 115–123. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] [4].Boyd S, Vandenberghe L, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Google Scholar]

[R5] [5].Cai D, He X, Han J, Semi-supervised discriminant analysis, In: 2007 IEEE 11th International Conference on Computer Vision, IEEE, 1–7. [Google Scholar]

[R6] [6].Cai T, Liu W, A direct estimation approach to sparse linear discriminant analysis, J. Amer. Statist. Assoc 106 (2011) 1566–1577. [Google Scholar]

[R7] [7].Cai T, Liu W, Luo X, A constrained ℓ₁ minimization approach to sparse precision matrix estimation, J. Amer. Statist. Assoc 106 (2011) 594–607. [Google Scholar]

[R8] [8].Chen J, Chen Z, Extended bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]

[R9] [9].Chen S, Witten DM, Shojaie A, Selection and estimation for mixed graphical models, Biometrika 102 (2014) 47–64. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] [10].Clemmensen L, Hastie T, Witten D, Ersbpll B, Sparse discriminant analysis, Technometrics 53 (2011) 406–413. [Google Scholar]

[R11] [11].Fan J, Fan Y, High dimensional classification using features annealed independence rules, Ann. Statist 36 (2008) 2605–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Fan J, Feng Y, Tong X, A road to classification in high dimensional space: The regularized optimal affine discriminant, J. R. Stat. Soc. Ser. B Stat. Methodol. 74 (2012) 745–771. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] [13].Fisher RA, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179–188. [Google Scholar]

[R14] [14].Friedman J, Hastie T, Tibshirani RJ, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (2008) 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] [15].Hand DJ, Classifier technology and the illusion of progress, Statist. Sci 21 (2006) 1–14. [Google Scholar]

[R16] [16].Hastie T, Tibshirani RJ, Buja A, Flexible discriminant analysis by optimal scoring, J. Amer. Statist. Assoc 89 (1994) 1255–1270. [Google Scholar]

[R17] [17].Hastie T, Tibshirani RJ, Friedman J, Elements of Statistical Learning Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009. [Google Scholar]

[R18] [18].Kim S, Pan W, Shen X, Network-based penalized regression with application to genomic data, Biometrics 69 (2013) 582–593. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Li C, Li H, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 (2008) 1175–1182. [DOI] [PubMed] [Google Scholar]

[R20] [20].Liu B, Shen X, Pan W, Semi-supervised spectral clustering with application to detect population stratification, Frontiers in Genetics 4 (2013) 215. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Liu Y, Yuan M, Reinforced multicategory support vector machines, J. Comput. Graph. Statist 20 (2011) 901–919. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Luo S, Chen Z, Edge detection in sparse gaussian graphical models, Comput. Statist.Data Anal. 70 (2014) 138–152. [Google Scholar]

[R23] [23].Luo S, Chen Z, Sequential lasso cum ebic for feature selection with ultra-high dimensional feature space, J. Amer. Statist. Assoc 109 (2014) 1229–1240. [Google Scholar]

[R24] [24].Mai Q, Yang Y, Zou H, Multiclass sparse discriminant analysis, arXiv preprint arXiv:1504.05845 (2015). [Google Scholar]

[R25] [25].Mai Q, Zou H, A note on the connection and equivalence of three sparse linear discriminant analysis methods, Technometrics 55 (2013) 243–246. [Google Scholar]

[R26] [26].Mai Q, Zou H, Yuan M, A direct approach to sparse discriminant analysis in ultra-high dimensions, Biometrika 99 (2012) 29–42. [Google Scholar]

[R27] [27].Meier L, van de Geer S, Buhlmann P, The group lasso for logistic regression, J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 53–71. [Google Scholar]

[R28] [28].Meinshausen N, Buhlmann P, High-dimensional graphs and variable selection with the lasso, Ann. Statist 34 (2006) 1436–1462. [Google Scholar]

[R29] [29].Min W, Liu J, Zhang S, Network-regularized sparse logistic regression models for clinical risk prediction and biomarker discovery, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 15 (2018) 944–953. [DOI] [PubMed] [Google Scholar]

[R30] [30].Negahban SN, Ravikumar P, Wainwright MJ, Yu B, A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers, Statist. Sci 27 (2012) 538–557. [Google Scholar]

[R31] [31].Obozinski G, Jacob L, Vert J-P, Group lasso with overlaps: The latent group lasso approach, arXiv preprint arXiv:1110.0413 (2011). [Google Scholar]

[R32] [32].Pan W, Shen X, Penalized model-based clustering with application to variable selection, J. Machine Learn. Res. 8 (2007) 1145–1164. [Google Scholar]

[R33] [33].Pan W, Xie B, Shen X, Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2010) 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] [34].Pang H, Liu H, Vanderbei R, The fastclime package for linear programming and large-scale precision matrix estimation in R, J. Machine Learn. Res. 15 (2014) 489–493. [PMC free article] [PubMed] [Google Scholar]

[R35] [35].Shao J, Wang Y, Deng X, Wang S, Sparse linear discriminant analysis by thresholding for high dimensional data, Ann. Statist 39 (2011) 1241–1265. [Google Scholar]

[R36] [36].Tibshirani RJ, Hastie T, Narasimhan B, Chu G, Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Nat. Acad. Sci 99 (2002) 6567–6572. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] [37].Vanderbei RJ, Linear Programming: Foundations and Extensions, 4th ed., Springer, 2014. [Google Scholar]

[R38] [38].Voorman A, Shojaie A, Witten D, Graph estimation with joint additive models, Biometrika 101 (2013) 85–101. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] [39].Witten DM, Tibshirani RJ, Penalized classification using Fisher’s linear discriminant, J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (2011) 753–772. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] [40].Wu M, Zhu L, Feng X, Network-based feature screening with applications to genome data, Ann. Appl. Statist 12 (2018) 1250–1270. [Google Scholar]

[R41] [41].Wu MC, Zhang L, Wang Z, Christiani DC, Lin X, Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection, Bioinformatics 25 (2009) 1145–1151. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R42] [42].Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J, Feature grouping and selection over an undirected graph, ACM, 2012, 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] [43].Yang Y, Zou H, A fast unified algorithm for solving group-lasso penalize learning problems, Stat. Comput 25 (2015) 1129–1141. [Google Scholar]

[R44] [44].Yu G, Liu Y, Sparse regression incorporating graphical structure among predictors, J. Amer. Statist. Assoc 111 (2016) 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] [45].Yuan M, Lin Y, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B Stat. Methodol. 68 (2006) 49–67. [Google Scholar]

[R46] [46].Yuan M, Lin Y, Model selection and estimation in the Gaussian graphical model, Biometrika 94 (2007) 19–35. [Google Scholar]

[R47] [47].Zhang C, Liu Y, Multicategory large-margin unified machines, J. Machine Learn. Res. 14 (2013) 1349–1386. [PMC free article] [PubMed] [Google Scholar]

[R48] [48].Zhang C, Liu Y, Wang J, Zhu H, Reinforced angle-based multicategory support vector machines, J. Comput. Graph. Statist 25 (2016) 806–825. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] [49].Zhang W, Wan Y-W, Allen GI, Pang K, Anderson ML, Liu Z, Molecular pathway identification using biological network-regularized logistic models, BMC Genomics 14 (2013) S7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] [50].Zhao P, Yu B, On model selection consistency of lasso, J. Machine Learn. Res. 7 (2006) 2541–2563. [Google Scholar]

[R51] [51].Zhao S, Shojaie A, A significance test for graph-constrained estimation, Biometrics 72 (2016) 484–493. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R52] [52].Zhou H, Pan W, Shen X, Penalized model-based clustering with unconstrained covariance matrices, Electron. J. Statist 3 (2009) 1473–1496. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] [53].Zhu Y, Shen X, Pan W, Simultaneous grouping pursuit and feature selection over an undirected graph, J. Amer. Statist. Assoc 108 (2013) 713–725. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Graph-based sparse linear discriminant analysis for high-dimensional classification

Jianyu Liu

Guan Yu

Yufeng Liu

Abstract

1. Introduction

2. Methodology

2.1. Motivation and formulation of GSLDA

2.2. Semi-supervised GSLDA

Figure 1:

3. Graph estimation and method implementation

3.1. Graph estimation

3.2. Parameter estimation and tuning parameter selection

3.3. Pre-screening

4. Theoretical properties

4.1. Selection consistency

4.2. Convergence rate

5. Simulation study

Figure 2:

Table 1:

Table 4:

6. Real data analysis

6.1. Arcene cancer data

Table 5:

6.2. Semeion handwritten digits dataset

Table 6:

7. Discussion

Table 2:

Table 3:

Acknowledgments

Appendix A. Some comments on the GSLDA method

Appendix A.1. A graphical display of the discriminant vector decomposition

Appendix A.2. Connection between GSLDA and existing methods

Figure A.1:

Appendix B. Numerical results

Appendix B.1. Graph estimation results

Table B.1:

Appendix B.2. Additional simulation results

Figure B.1:

Figure B.2:

Appendix C. Proofs to the theoretical results

Appendix C.1. Proof of Proposition 1

Appendix C.2. Proof of Theorem 1

Appendix C.3. Proof of Theorem 2

Appendix C.4. Proof of Theorem 3

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases