Abstract
Linear discriminant analysis (LDA) is a well-known classification technique that enjoyed great success in practical applications. Despite its effectiveness for traditional low-dimensional problems, extensions of LDA are necessary in order to classify high-dimensional data. Many variants of LDA have been proposed in the literature. However, most of these methods do not fully incorporate the structure information among predictors when such information is available. In this paper, we introduce a new high-dimensional LDA technique, namely graph-based sparse LDA (GSLDA), that utilizes the graph structure among the features. In particular, we use the regularized regression formulation for penalized LDA techniques, and propose to impose a structure-based sparse penalty on the discriminant vector β. The graph structure can be either given or estimated from the training data. Moreover, we explore the relationship between the within-class feature structure and the overall feature structure. Based on this relationship, we further propose a variant of our proposed GSLDA to utilize effectively unlabeled data, which can be abundant in the semi-supervised learning setting. With the new regularization, we can obtain a sparse estimate of β and more accurate and interpretable classifiers than many existing methods. Both the selection consistency of β estimation and the convergence rate of the classifier are established, and the resulting classifier has an asymptotic Bayes error rate. Finally, we demonstrate the competitive performance of the proposed GSLDA on both simulated and real data studies.
Keywords: Feature structure, Gaussian graphical models, Regularization, Undirected graph
1. Introduction
Classification problems are commonly seen in practice. There are many existing classification techniques in the literature; see [2, 17] for a comprehensive review. Among various existing methods, linear discriminant analysis (LDA) has a long history and remains an important tool in the standard classification toolbox. LDA can be viewed as a rule for a classification problem of two Gaussian populations with a common covariance matrix. Despite its seemingly strong assumptions, LDA often works well in practice, especially for low-dimensional problems [15]. It mimics Bayes’ rule and has a simple closed form which only involves the within-class sample covariance matrix and group averages. Given these estimates, the original formulation for the discriminant vector of LDA is computed as the product of the inverse within-class sample covariance matrix and the mean difference vector. Thus, standard LDA can be computed and implemented easily in the traditional low-dimensional setting. LDA also has interpretations beyond the Gaussian model. In particular, the same formulation can be obtained from Fisher’s discriminant analysis problem [13], the optimal scoring problem [16], and linear regression [17].
Despite the usefulness of LDA, it needs to be adapted when the dimension of features is high. For example, the form of standard LDA is only valid when the sample covariance matrix is invertible. Moreover, as the dimension grows, the errors in the sample covariance and group means accumulate and consequently LDA can become increasingly unstable [11, 35]. To address this problem, a number of LDA extensions have been proposed for high-dimensional scenarios.
The existing high-dimensional LDA methods in the literature can be roughly divided into two categories, plugin approaches and direct approaches. A plug-in approach tackles high-dimensional problems by using regularized estimates for the within-class covariance matrix and group means. For example, the naive Bayes method, or the independence rule, treats the covariance matrix as diagonal. Bickel and Levina [1] showed that it outperforms LDA with Moore–Penrose pseudoinverse covariance matrix when the dimension grows faster than the sample size. To further reduce the instability of LDA, Tibshirani et al. [36] additionally used shrunken estimates of group means. Fan and Fan [11] showed that, even under the independence feature assumption, naive Bayes can be as bad as random guessing due to error accumulation in group means. They resolved this issue by reducing the dimension via feature screening. In contrast to these independence rules, Shao et al. [35] assumed sparsity of the covariance matrix and the mean difference vector, and used thresholded estimates to construct a sparse LDA classifier. It was shown to be asymptotically optimal under certain conditions. All of these methods adopt the original formulation of LDA by calculating some improved estimates of the covariance matrix and group means. Thus, some strong assumptions on the covariance matrix and the group means need to be imposed for the resulting LDA rule.
In contrast to the plug-in methods, direct approaches aim at estimating the discriminant vector β directly. Since LDA can also be obtained from some risk minimization problems, it can be extended to high-dimensional scenarios via these formulations with regularization on β. For example, Wu et al. [41] considered Fisher’s discriminant analysis and proposed an ℓ1 -penalized version for dimension reduction. The corresponding problem has a piece-wise linear solution path which can be computed efficiently. Witten and Tibshirani [39] also used Fisher’s discriminant analysis formulation for a general K-class problem with a general regularization. Clemmensen et al. [10] proposed the optimal scoring formulation with the ℓ1 penalty. Following the idea of minimizing the misclassification rates, Fan et al. [12] proposed a method closely related to the method by Wu et al. [41] and directly computed the misclassification rate of the classifier. Mai et al. [26] took advantage of the regression formulation and estimated the discriminant vector of LDA by solving a Lasso-type problem, which was shown to have the same solution path as the method of Wu et al. [41] and the method of Clemmensen et al. [10] when K = 2; see [25]. Using a different idea of direct estimation, Cai and Liu [6] formed a linear programming problem to estimate β and showed that the error rate of the estimated classifier is close to the Bayes rule under certain conditions. Compared to plug-in approaches, these methods estimate LDA directly and the assumptions can be less stringent since only the sparsity of the discriminant vector of LDA is assumed [6].
Both plug-in and direct methods can work well for certain practical problems. However, these methods do not utilize the feature structure information when available. In practice, features are often correlated with some structure. Such structure can usually be represented by an undirected graph . Connected features may work together and thus be effective or not effective simultaneously for classification. For instance, in the diagnosis of a disease using genetic information, genes are naturally grouped by their functions or gene pathways. Relevant genes tend to contribute or not contribute to the disease together. Moreover, when the population in consideration is Gaussian, the conditional independence graph, or Gaussian graphical model, often represents a natural structure. By considering such structure information, we are likely to be able to construct a better classifier. For regression problems, there are some methods that utilize the graph structure in the literature; see, e.g., [3, 18, 33, 53]. For example, Li and Li [19] proposed a penalty on the coefficient difference of each pair of connected features. Yang et al. [42] used pairwise ℓ∞ penalties on relevant features to encourage simultaneous inclusion and exclusion. Based on the decomposition of the regression coefficient vector, Yu and Liu [44] proposed a node-wise penalty. In particular, the regularization term is the summation of penalties over all nodes rather than all edges. Compared to pairwise penalties, the node-wise penalty is better motivated and computationally efficient. More recently, Zhao and Shojaie [51] proposed new inference methods for such graph-constrained estimation.
Despite great progress for regression problems, much less research has been done for classification problems. Structured penalties such as group Lasso and fused Lasso have been employed in classification methods [27, 39], but they are not applicable to a general sparse graph structure among predictors. Zhang et al. [49] considered logistic regression with a combination of ℓ1 penalty and pairwise ℓ2 difference penalty. Min et al. [29] generalized the regularization and provided a unified algorithm. However, both methods may also suffer from too much computational burden in high dimensions. Very recently, Wu et al. [40] proposed an unsupervised graph-based variable screening method for general problems.
In this paper, we propose a new method, called graph-based sparse LDA (GSLDA), that exploits the graphical structure of features. GSLDA estimates LDA in high dimensions directly by solving a convex optimization problem. Similar to the sparse regression method in [44], we incorporate the graph structure through a node-wise penalty. In the presence of an underlying feature structure, the new method outperforms existing high-dimensional LDA methods by utilizing the structure directly. As a key component, the graphical structure can be either given or estimated from the training data. In addition, we investigate the relationship between the within-class inverse covariance matrix and overall inverse covariance matrix. Based on these findings, we propose a variant of GSLDA that can utilize unlabeled data, which are often much more accessible than labeled data. We name this variant as the semi-supervised GSLDA. Selection consistency is shown for the estimated discriminant vector. Moreover, we show that the misclassification rate of our classifier converges to the Bayes error rate at a fast rate under certain conditions. Numerical studies are used to demonstrate the performance of this method. In particular, the semi-supervised GSLDA enjoys higher classification accuracy than the original GSLDA method in most cases. This reveals the potential advantages of using unlabeled data in classification problems.
The rest of the paper is organized as follows. In Section 2, we review some existing high-dimensional LDA methods, and introduce our motivations and formulations of our proposed methods. Section 3 focuses on graph estimation and the implementation of GSLDA. In particular, graph estimation methods are discussed for both GSLDA and its variant. In Section 4, theoretical justification is provided for our method. Sections 5 and 6 demonstrate the performance of GSLDA by simulated examples and real data studies respectively. We conclude this paper with some discussion in Section 7. Proofs of the theoretical results are provided in the Appendix.
2. Methodology
In this section, we first review LDA and construct a relationship between β and the graph structure of features in Section 2.1, based on which GSLDA is proposed. We also explain how to estimate the graph structure when it is not directly available and discuss the connections of our methods with several existing classification methods. In Section 2.2, we investigate the overall graph structure of the features and consider a variant of GSLDA which can efficiently utilize unlabeled data.
2.1. Motivation and formulation of GSLDA
We first discuss the problem setting and introduce some notations. Given the training dataset {(x1, g1), … , (xn, gn)} where for each i ∈ {1, … , n}, is the feature vector and gi ∈ {1, 2} is the class label. A linear classifier gβ0,β is defined as follows. For any , gβ0,β(x) = 1 if β0 + x⊺β > 0 and 2 otherwise. In particular, we consider the standard setting of the two-class LDA. That is, the binary label G takes 1 with probability π1 and 2 with probability π2 = 1 − π1 and the feature vector X has a conditional Gaussian distribution, i.e., for k ∈ {1,2}. Under this setting, the Bayes classifier is specified by
| (1) |
where δ = μ(1) − μ(2). By replacing Σ and δ in (1) with their sample estimates, we have the LDA classifier with . Typically, we take and , where nk, and S(k) denote respectively the sample size, mean and covariance matrix for group k. Note that this formulation is valid only when n > p. In high-dimensional problems or when n ≤ p, there are various extensions of LDA that either use the formulation with shrunken estimates of Σ and δ or find a direct estimation of β; see [7, 12, 26, 35, 36]. Here we focus on the direct estimation approach.
Inspired by the regression formulation of LDA [17], Mai et al. [26] proposed the direct sparse discriminant analysis (DSDA) method to estimate β by solving the Lasso problem
where yi = n/n1 if gi = 1 and −n/n2 if gi = 2. It was shown that DSDA gives the same solution path as the method in [25, 41]. Compared to plug-in approaches, the DSDA estimates β directly in high dimensions and the assumptions are less stringent. However, it is unclear how we can utilize any structure information among features with the method or other high-dimensional LDA methods.
Assume that there is some structure among the features. In particular, we consider the case where the structure can be represented by a graph, denoted as . There are methods that effectively use the graph structure in regression problems. For example, Li and Li [19] used the penalty
where dj denotes the neighborhood size of feature j, to encourage close coefficients for connected features. Yang et al. [42] employed pairwise ℓ∞ penalty for connected features, i.e., , so their coefficients can be estimated zero or nonzero simultaneously. Recently, Yu and Liu [44] proposed a node-wise penalty
based on the decomposition of regression coefficient vector β = var(X)−1cov(X, Y). In contrast to these developments for regression problems, little work has been done for classification problems.
We propose our method formulation based on a decomposition of β*, the discriminant vector of Bayes’ rule. Denote Ω = Σ−1 the within-class precision matrix and δ = μ(1) − μ(2) the group mean difference. We can decompose the discriminant vector β* in (1) as
| (2) |
where ωj is the jth column of Ω. Recall that the support of Ω in fact forms a conditional correlation graph of features X. In this way, the optimal discriminant vector is linked to the Gaussian graph structure of the features. We use a toy example for demonstration. In a 3-dimensional LDA setting, assume ω23 = ω32 = 0, then β* = Ωδ = (δ1ω11 + δ2ω21 + δ3ω31, δ1ω12 + δ2ω22, δ1ω13 + δ3ω33)⊺. See Figure A.1 in the Appendix for a graphical demonstration of the decomposition.
Denote the graph corresponding to Ω as , and the neighborhood of feature j ∈ {1, … , p} as . Replacing δjωj by v(j), then β* = v(1) + ⋯ + v(p), where v(j) is either 0 (when δj = 0) or with a support when δj ≠ 0. Instead of estimating β* itself, we can estimate v(j)’s. Moreover, the decomposition (2) motivates a natural regularization on {v(1), … , v(p)}, viz.
in which and the τjs are positive weights. Note that the group ℓ2 penalty on v(j) encourages a group sparsity effect, i.e., v(j) is estimated as 0 or a sparse vector with support , which matches the decomposition (2). In the formulations, the τjs are weights for the group regularization. In particular, the larger τj is, the more likely v(j) is estimated as 0. Similar to the group Lasso [45], we can take
where .
We need to apply this regularization to a risk minimization framework of LDA to formulate our method. The regression formulation is an appropriate one due to its simplicity and convenience for theoretical analysis. By combining the formulation with the group regularization, we can estimate {v1, … , vp} by
| (3) |
where for all j ∈ {1, … , p}. Then β is estimated as . Furthermore, from the perspective of β estimation, the formulation is equivalent to
| (4) |
where
| (5) |
can be viewed as a structured regularization on β; see [31]. Since the regularization is specified by the graph , we call the method graph-based sparse LDA (GSLDA). Although we use the same squared loss function as in [26], our method focuses on utilizing the graph structure of features in β* estimation. We use the estimator from (4) for the discriminant vector β. With respect to β0, the estimator from (4) may not be a good choice for the classification problem due to the regression formulation. To solve this problem, we adopt a similar approach by [26] and estimate it by
While the GSLDA method is motivated from the discriminant vector decomposition (2), the decomposition of β* is not restricted to this form only. Therefore, the graph structure used in our method is not restricted to the conditional independence graph. We will present another decomposition of β* in Section 2.2. In fact, any graph structure of features satisfying our assumptions in Section 4.1 can be possibly used. When the structure information is available, e.g., the gene pathways in genetic studies, we can construct a graph using the gene pathway information. If the graph is not available, we can estimate it based on the training data. There are many methods for estimation of Gaussian graphical models, including the neighborhood selection [28], the graphical Lasso [14, 46], and the CLIME [7]. We will discuss them further in Section 3. In summary, GSLDA can be implemented in two steps: (i) graph construction and (ii) direct estimation of β via solving formulation (4).
The formulation (4) is closely related to the regression method proposed in [44]. However, both the problem setting and the motivation of our paper are different. In our problem, the response y is a binary variable and the features are from a mixed population. Although our formulation also uses the squared loss as in regression, the “error” has a very different interpretation and distribution. In particular, the distribution of depends on xi. These issues bring unique challenges for the theoretical analysis of GSLDA. Although there are some classification methods that also utilize predictor structure, such as logistic regression with group Lasso penalty [27] and LDA with fused Lasso penalty [39], these methods do not utilize a general graph structure.
Depending on the feature structure, there are special cases in which GSLDA is closely connected with existing sparse LDA methods. For example, if we use an empty graph with no edge at all, the regularization (5) simplifies to τ1|β1| + ⋯ + τp|βp|. Then, formulation (4) becomes an adaptive Lasso type problem, viz.
When all penalty weights τj take value 1, the GSLDA is equivalent to the DSDA method in [26]. When the graph consists of K disjoint complete subgraphs, denoted as , then the regularization (5) simplifies to where τ(k) = minj∈G(k) τj and G(k) is the index set of predictors involved in the subgraph . In this case, GSLDA becomes a variant of DSDA with the group Lasso penalty, i.e.,
For a general graph , our method is different from the existing ones.
Remark 1. While we are mainly concerned with binary classification in this paper, there are many scenarios with more than two classes [21, 47, 48]. Our GSLDA method can also be extended to the multi-class case. For example, consider a formulation of K-class sparse LDA proposed in [24], viz.
where θ2, … , θK are discriminant vectors and for j ∈ {1, … , p}. The resulting discriminant rule is where and is the proportion of class k in the sample. We can take advantage of a similar formulation with the graph-based regularization , where
This formulation can be solved in a way similar to the binary GSLDA. Nevertheless, we do not pursue this direction in the paper so we can focus on core ideas of the GSLDA.
2.2. Semi-supervised GSLDA
With recent advances in graphical estimation [7, 28, 46], we can estimate for the GSLDA based on the training data when the graph structure is unknown. However, as the dimension p increases, we expect the selection error to accumulate. When the dimension is much larger than the sample size, the graph estimate of GSLDA can be almost random. We use a toy example in Figure 1 to illustrate this phenomenon. In the setting of standard LDA, we set weights π1 = π2 = 0.5, and group means μ(1) = (0.5, … , 0.5, 0, … , 0)⊺ and μ(2) = (−0.5, … , −0.5, 0, … , 0)⊺, which only differ in the first 10 features. To specify the graph structure, Ω is generated from an AR(5) model, i.e., Ωjj = c, Ωjℓ = −0.5 if 1 ≤ |j − ℓ| ≤ 5 and 0 otherwise, where c > 0 is a scalar such that the eigenvalues of Ω are between 0 and 1. We standardize Ω so that diag(Ω) = 1 and define in-class covariance matrix Σ = Ω−1. Let the sample size n be 50 and p vary from 10 to 200. We estimate the graph by SR-SLasso [22] with extended BIC for tuning. For each setting, we repeat the procedure 100 times and evaluate the accuracy of graph estimation by false positive rate (FPR) and false negative rate (FNR). Figure 1 summarizes the performance of graph estimation for varying dimensions.
Figure 1:

Performance evaluation of graph estimation for varying dimensions. The black solid lines are for graph estimation based on a labeled dataset of size 50; the red dashed lines are for graph estimation based on an unlabeled dataset of size 1000; vertical segments indicate the standard deviations of FPR or FNR of 100 repetitions.
As shown in Figure 1, the graph estimation using only labeled data deteriorates quickly as the dimension increases. Note that the structured penalty in (5) encourages the coefficients of all features in a neighborhood to be nonzero together as long as some of them is useful for classification. Inaccurate graph estimation can reduce the accuracy and the interpretability of GSLDA.
Compared to labeled data, unlabeled data can be more accessible in many applications. For example, in the handwritten digit recognition problem discussed in Section 6.2, we can easily obtain a large number of images of different digits. However, it can be expensive to label these images by corresponding digits. As a result, many semi-supervised methods try to utilize the unlabeled data to improve the classification accuracy [5, 32]. In this paper, we focus on using unlabeled data for the graph construction when available. The following proposition studies the relationship between the within-class inverse covariance matrix and the overall one.
Proposition 1. Assume X comes from a mixture of two populations with a common covariance matrix Σ. The weight and the expectation of population k ∈ {1, 2} is πk and μ(k). Denote the mean difference of the two populations μ(1)−μ(2) as δ. We denote the overall covariance matrix of the population mixture and the overall precision matrix. Then and , where β* = Ωδ and .
As a remark, we do not require any specific distribution for the populations in Proposition 1, while β* is the optimal discriminant vector if both classes are Gaussian populations. The overall precision matrix is sparse if both Ω and β* are sparse, and its support forms the conditional correlation graph of the mixed population. Moreover, we have . In our problem, a decomposition of the optimal discriminant vector analogous to (2) using can be written as
where ξ is a positive scalar and is the jth column of . Therefore, the Bayes classifier can be connected to the graph structure of the mixed population through the new decomposition. Define the graph corresponding to the support of as . Following the same rationale of GSLDA, we can formulate another estimator of β based on the overall graph structure, viz.
| (6) |
where is defined in (5) and adapts to as in (4). The only difference between (6) and (4) is which graph structure we use. When unlabeled data are abundant, the estimated graph can be more accurate and thus the new formulation may provide better classification. We name the formulation (6) as semi-supervised GSLDA. Similar to the original GSLDA, the semi-supervised variant also has two step: (i) graph estimation based on all available data and (ii) direct estimation of β by solving formulation (6).
Both versions of GSLDA need to estimate a graph when no prior graph structure is given. But there is a major difference: unlike in (4), the graph in (6) is not for a Gaussian population but a Gaussian mixture. As we will see in Section 3, likelihood-based estimation such as graphical Lasso would be too complicated to implement. Instead, we can still use neighborhood selection. In fact, in regressing the feature Xj on the other features X−j, the coefficient vector corresponds to the conditional correlations between Xj and other features regardless of the distribution of the features, as stated by the following lemma.
Lemma 1. For any random vector X = (X1, … , Xp)⊺ ~ F, assume we have finite second-order moments and denote , and . Then for any j, ℓ ∈ {1, … , p},
-
(i)
, the (j, ℓ)th element of , is 0 if and only if Xj and Xℓ are conditionally uncorrelated, i.e., cov(Xj, Xℓ|X−{j,ℓ}) = 0, where X−{j,ℓ} denotes all features other than Xj and Xℓ;
-
(ii)
is 0 if and only if , where is the coefficient of Xℓ in the regression of Xj on X−j.
This lemma is closely related to the results in [28]. According to Lemma 1, the graph based on the inverse covariance matrix always corresponds to the conditional correlation structure. As long as variable selection consistency of the regression is guaranteed, neighborhood selection methods are valid for graph estimation. Figure 1 also shows the performance of graph estimation based on a large unlabeled dataset under the same settings. We can observe that the estimation still performs well when the dimension increases.
Remark 2. In practice, we generally use all available data, including both unlabeled and labeled data, in the first step of semi-supervised GSLDA. Note that even without unlabeled data, the method is still applicable. If we use neighborhood selection for graph estimation, then the error variance of the jth node-wise regression is by Proposition 1. In contrast, when using the labels as in the original GSLDA, the error variance is . Therefore, the semi-supervised GSLDA has better graph estimation only when unlabeled data are abundant. When there are relatively little unlabeled data, the original GSLDA is more advantageous.
3. Graph estimation and method implementation
If the feature structure is given from prior knowledge, the graph can be directly constructed by assigning edges between related features. Otherwise, we need to estimate the graph based on training data. In particular, when unlabeled data are available, we can also use that to estimate the graph and implement semi-supervised GSLDA. In this section, we first discuss specific graph estimation methods for GSLDA. Then we introduce algorithms to solve formulation (4) as well as some strategies for efficient implementation.
3.1. Graph estimation
There have been extensive studies on graphical model estimation [7, 9, 14, 28, 38, 46]. As we discussed in Section 2.2, the graph estimation based on labeled and unlabeled data are different to some extent. Next we discuss them separately. Given labeled data, the likelihood conditional on the labels becomes
Similar to the graphical Lasso, we can estimate Ω by minimizing ℓ1 penalized log-likelihood, i.e.,
where denotes the set of p-dimensional positive definite matrices and ||Ω||1 = Σj≠ℓ|ωjℓ|. It results in , and
| (7) |
This is equivalent to the graphical Lasso for the centered data .
Instead of solving (7), we can also estimate the graph by neighborhood selection as proposed by [28]. This method solves p node-wise regularized regressions, viz.
where denotes the jth feature of sample from group k and represents the other features. One can verify that
| (8) |
where denotes the data centered by subtracting corresponding group means. We can also use sequential Lasso [23] for computational efficiency. The graph is constructed by connecting nodes j and ℓ if and/or .
Both approaches for estimating have been justified theoretically [28, 46]. In this paper, we recommend to use neighborhood selection approaches for GSLDA. The main reason is that the former approaches, such as graphical Lasso, usually run through many iterations and can be slow for high-dimensional data (p > 1000). In contrast, neighborhood selection approaches only require p penalized regressions. Moreover, our direct interest is not Ω but the graph on which neighborhood selection focuses. We use the extended BIC (EBIC) [8] to select λ in (8). As suggested in [8], we choose 1 − 1/(2 logn p) as the EBIC tuning parameter.
When we have an extra unlabeled dataset, denoted as xn+1, … , xn+m, the likelihood becomes complicated because of the Gaussian mixture distribution of the unlabeled data. Thus it is difficult to estimate the parameters via likelihood. Moreover, the graph we need is directly related to rather than Ω. Thus, a penalized likelihood approach is not suitable. Nevertheless, the neighborhood selection approaches are still valid by Lemma 1, because we are concerned with conditional correlation. In particular, we estimate the neighborhoods by
where denotes the combined feature matrix. Similarly, we use EBIC to select the tuning parameter λ.
3.2. Parameter estimation and tuning parameter selection
Given the graph , formulation (4) is a latent group Lasso problem [31]. It can be transformed to an ordinary group Lasso problem as stated in Problem (3). There are many efficient algorithms to solve group Lasso problems, for example, groupwise majorization descent [43]. For very high-dimensional data, we use an iterative proximal algorithm as in [44]. For implementation, we use cross validation for tuning parameter selection.
3.3. Pre-screening
Suppose that there are some entries of δ being zero. Then β* can be a linear combination of only a few column vectors,
where J = {j : δj ≠ 0}. Using two-sample t tests for screening, we can specify J′ ⊂ {1, … , p}, which is a superset of J with a large probability. In particular, we have the following lemma.
Lemma 2. Define the t-statistic , where is the sample variance of feature j ∈ {1, … , p} in group k ∈ {1,2}. Assume ln p = o(nγ), ln |J| = o(n1/2−γ Bn), and for some γ ∈ (0,1/3) and Bn → ∞. Then there exists C > 0 such that
The result in Lemma 2 was previously obtained by Fan and Fan [11] and the corresponding proof is omitted. Lemma 2 guarantees the accuracy of our pre-screening procedure.
After feature screening, the proposed regularization can be simplified as follows:
| (9) |
Compared with the original regularization (5), the new one in (9) is often simpler and enjoys computational advantages. Moreover, the new regularization (9) only requires part of the graph, i.e., the part corresponding to the support of {ωj : j ∈ J′}. Graph estimation methods based on neighborhood selection fit into this idea naturally. When δ is approximately sparse and |J′| ⪡ p, the computational cost can be reduced substantially. Unlike the feature screening in [11], features outside J′ are not necessarily excluded. Instead, they can be introduced into the model via connection with other features in J′.
4. Theoretical properties
In this section, we study the theoretical properties of GSLDA. In particular, the original GSLDA in (4) with a known graph is considered. Since the semi-supervised GSLDA only differs from GSLDA in the graph used, we do not consider it separately. In Section 4.1, we show the selection consistency of GSLDA. In Section 4.2, we study the misclassification rate of the GSLDA and compare it with the Bayes error.
Before diving into the theoretical analysis, we first introduce some notations for our setting. We define, for an n-dimensional vector a, ||a||∞ = max(|a1|, … , |an|); for an n × m matrix A, ||A||∞ = maxi{|Ai1| + ⋯ + |Aim|} and |||A|||∞ = maxi,j |Aij|. We consider the problem setting of standard LDA, in which both within-class populations are Gaussian, i.e., and . The discriminant vector of the Bayes rule, denoted as β*, is given in (1). Denote the active set, and s = |A|. Define , then β† is proportional to β* (Proposition 1) and thus defines an equivalent classifier.
4.1. Selection consistency
Assume that the feature vectors are centralized, thus . Denote and . Define
We present several assumptions to be used as follows.
-
(A1)
p = O{exp(nγ)}, s = o(na), for some γ ∈ (0,1), a ∈ (0, (1 − γ)/2).
-
(A2)
For every j ∈ {1, … , p}, either or .
-
(A3)
is bounded by φ < ∞.
-
(A4)
.
-
(A5)
.
Here (A1) specifies the order of feature dimension as well as the number of discriminating features. By Assumption (A2), a discriminative feature can only be connected with other discriminative features. This is a reasonable condition in reality since a feature is often relevant for classification if it is related to another useful feature. Condition (A3) ensures that there is no extreme collinearity among discriminative features. Assumption (A4) is an irrepresentability condition that is often employed in showing the selection consistency of regularized estimators [28, 50].
It may not be immediately clear why we impose the irrepresentability condition (A4) on rather than Ω. Note that the more similarity between predictive and non-predictive features, the more difficult it is to achieve selection consistency. While Ω encodes the within-class feature dependence, the relationship among features in the whole dataset is determined by the overall covariance. Thus we impose the condition on . The main theoretical result on the selection consistency of the GSLDA is given in the following theorem.
Theorem 1 (Selection consistency). Under conditions (A1)-(A5), let and n be sufficiently large, then the GSLDA recovers the active set A and with probability at least 1 − O(p−C1) for some C1 > 0.
When we use an empty graph and set τj = 1 for all j, our GSLDA is equivalent to the DSDA method. In this special case, τ* = τ* = 1, and the selection consistency conditions are similar to those for DSDA [26].
4.2. Convergence rate
With respect to a classifier, the error rate is one of the most important performance measures. In this section, we investigate the misclassification rate of GSLDA. We first present some basic results on the classification problem. For a linear classifier ĝ, denote its classification error under our settings as . Then we have the following results from [6].
Lemma 3 (Classification error rate in LDA setting). Under our setting,
where Φ denotes the cumulative distribution function of . The misclassification rate of the Bayes classifier is , where Δ = δ⊺Ωδ.
Since Q is a continuous function of β0 and β, the misclassification rate of the GSLDA classifier is asymptotically the same as the Bayes error rate, i.e., , as long as . A more interesting problem is the order of the misclassification rate of the GSLDA when . To investigate this, we first introduce a new condition, under which we can construct an ℓ2 error bound for the GSLDA estimator.
-
(A6)
Denote , where and . For all , .
This is actually a restricted eigenvalue condition, which is often used in showing the error bound for regularized estimators [30]. Compared to the irrepresentability condition (A4), this is much less stringent. With the new condition, we have the following ℓ2 error bound for the GSLDA estimator.
Theorem 2 (ℓ2-error bound). Under conditions (A1)–(A2) and (A6), let for some C2 > 0 and n be sufficiently large, then with probability at least 1 − sp−C3 for some C3 > 0.
Based on Theorem 2 above, we can establish the asymptotic error rate of the GSLDA classifier as follows.
Theorem 3 (Convergence rate). Under conditions (A1)–(A2) and (A6), as n, p → ∞, if Δ → ∞, we have
given and , where Δ is defined as in Lemma 3 and λmax(Σ) denotes the largest eigenvalue of Σ.
That is, under mild conditions, the misclassification rate of the GSLDA classifier is of the same order as the Bayes error rate in this case.
5. Simulation study
To demonstrate the performance of the GSLDA methods, we compare them with several existing high-dimensional LDA extensions and other classification methods. The methods in comparison include the naive Bayes rule (NB), nearest shrunken centroids (NSC), sparse LDA (SLDA) [35], ℓ1 penalized Logistic regression (PLR), penalized Fisher’s discriminant analysis (PLDA) [39], direct sparse discriminant analysis (DSDA) [26], linear programming discriminant (LPD) [6], and the ROAD [12]. In particular, the methods NSC, PLR, PLDA and DSDA are implemented with R packages pamr, glmnet, penalizedLDA and dsda, respectively. We implement the LPD method via the parametric simplex algorithm [37] as suggested in [34].
Besides the above supervised methods, there are many semi-supervised clustering (or classification) methods; see, e.g., [20, 32, 52]. We have implemented the semi-supervised spectral clustering (SSSC) method proposed in [20]. Both the original and the semi-supervised GSLDA are implemented, and the latter is denoted as GSLDA-S. We also include the GSLDA methods with the true graph, denoted as GSLDA-O (with ) and GSLDA-SO (with ), in the comparison. To make a fair comparison, pre-screening is not employed in the numerical studies. The Bayes rule, denoted as Oracle, is used as a benchmark.
In the simulation, we fix the dimension p = 200 and the sample size n = 200. The labels g1, … , gn are generated with π1 = π2 = 1/2 and the features are sampled from based on the labels. Moreover, we generate an independent dataset of sample size 2000 and remove the labels, for the semi-supervised methods. All tuning parameters are selected by 10-fold cross validation. We consider four different feature structures as follows.
Example 1. Blockwise sparse model. In this example, ΣB is a 5 × 5 matrix with 1 for the diagonal and 0.7 for off-diagonal elements. We use 20 such blocks for the diagonal of the covariance matrix Σ and 0 for the rest, and let Ω = Σ. The group means are generated such that for j ∈ {5, 10, … , 25} and otherwise; and μ(2) = −μ(1).
Example 2. AR(3) model. The precision matrix Ω is generated such that ωjj = 1, and ωjℓ = −2/3 if 1 ≤ |j − ℓ| ≤ 3 and 0 otherwise. The group means are generated such that for j ∈ {5, 10, … , 25} and otherwise; and μ(2) = −μ(1).
Example 3. Random sparse model. The graph is generated in such a way that any two nodes are connected with probability 0.05. Based on , we generate the precision matrix Ω by setting ωjℓ = −0.5 for all connected j and ℓ in the graph and 0 otherwise. We add c Ip, where c > 0 and Ip is an identity matrix, to Ω such that the eigenvalues are between 0 and 1. We standardize Ω so that its diagonal elements are all 1. The group means are generated in such a way that for all j ∈ S and 0 otherwise; and μ(2) = −μ(1).
Example 4. Scale-free random graph. The graph is generated in a way similar to the Barabasi-Albert (BA) model. Starting from an identity matrix , at step i we randomly assign −0.5 to min{⌊0.05 p⌋, i − 1} entries in row i with probability Pr(i, j) ∝ #{Lℓj ≠ 0 : 1 ≤ ℓ ≤ p}, j < i. Repeat the procedure until i = p. Then we get a lower triangular matrix. We construct Ω = L⊺L and standardize it such that the eigenvalues are between 0 and 1. Denote the 6th to 10th most connected nodes as J. The group means are generated such that for all j ∈ J and 0 otherwise; and μ(2) = −μ(1).
All four graph structures are displayed in Figure 2. The first two examples are fixed while the last two produce random graphs. Compared with the random sparse model, the scale-free random graphs are featured with hubs. For each graph structure, we repeat the simulation for 100 times and evaluate the performance, both prediction and selection accuracy, of all classification methods. Table B.1 in the Appendix displays the graph estimation accuracy for all examples.
Figure 2:

The graph structures used in the simulation study. From left to right: the blockwise sparse model, the AR(3) model, the random sparse model, the scale-free model. The last two plots use one realization for demonstration, and the graphs may vary among different realizations.
Tables 1–4 give a summary of the performance comparison of all methods in Examples 1 and 4. In particular, misclassification rates in percentage (Error), false positives (FP) and false negatives (FN) of β estimation are computed. The misclassification rate is evaluated based on an independent test dataset of size 20,000. All metrics are averaged over 100 simulations and the numbers within parentheses are the standard errors. Both the NB and the SSSC are not considered in the comparison of variable selection, since these methods do not perform variable selection.
Table 1:
Performance comparisons of different classification methods for Example 1.
| Error | FP | FN | Size | |
|---|---|---|---|---|
| NB | 27.01 (0.18) | — | — | — |
| NSC | 14.17 (0.11) | 0.71 (0.54) | 20.27 (0.13) | 5.44 (0.62) |
| SLDA | 10.28 (0.16) | 5.71 (1.29) | 12.53 (0.31) | 18.18 (1.61) |
| PLR | 7.17 (0.13) | 14.73 (0.56) | 8.1 (0.24) | 31.63 (0.69) |
| DSDA | 6.76 (0.13) | 23.26 (1.53) | 6.79 (0.27) | 41.47 (1.71) |
| LPD | 7.80 (0.38) | 37.20 (1.97) | 5.73 (0.29) | 56.47 (2.17) |
| ROAD | 6.54 (0.12) | 23.45 (1.24) | 6.01 (0.24) | 42.44 (1.37) |
| PLDA | 14.16 (0.10) | 3.62 (1.16) | 19.53 (0.16) | 9.09 (1.29) |
| SSSC | 8.11 (0.10) | — | — | — |
| GSLDA | 5.57 (0.07) | 20.48 (2.17) | 7.31 (0.25) | 38.17 (2.33) |
| GSLDA-S | 4.53 (0.06) | 18.79 (2.26) | 0.74 (0.11) | 43.05 (2.29) |
| GSLDA-O | 4.86 (0.08) | 18.55 (2.63) | 0 (0) | 43.55 (2.63) |
| GSLDA-SO | 4.52 (0.07) | 16.43 (1.93) | 0 (0) | 41.43 (1.93) |
| Oracle | 3.27 (0.01) | 0 (0) | 0 (0) | 25 (0) |
Table 4:
Performance comparisons of different classification methods for Example 4.
| Error | FP | FN | Size | |
|---|---|---|---|---|
| NB | 32.84 (0.28) | — | — | — |
| NSC | 22.78 (0.12) | 8.62 (1.76) | 48.87 (0.76) | 18.75 (2.49) |
| SLDA | 17.53 (0.27) | 19.83 (1.29) | 38.23 (0.67) | 40.60 (1.98) |
| PLR | 14.60 (0.21) | 16.51 (0.66) | 35.5 (0.5) | 40.01 (1.06) |
| DSDA | 13.48 (0.17) | 33.71 (1.9) | 28.18 (0.65) | 64.53 (2.47) |
| LPD | 16.87 (0.36) | 46.86 (1.66) | 29.17 (0.71) | 76.69 (2.31) |
| ROAD | 13.95 (0.19) | 36.9 (2.01) | 27.71 (0.77) | 68.19 (2.69) |
| PLDA | 22.64 (0.12) | 6.6 (1.06) | 51.58 (0.45) | 14.02 (1.48) |
| SSSC | 12.08 (0.21) | — | — | — |
| GSLDA | 10.46 (0.11) | 21.53 (1.7) | 15.69 (0.55) | 64.84 (2.14) |
| GSLDA-S | 9.15 (0.12) | 12.87 (1.29) | 5.03 (0.54) | 66.84 (1.57) |
| GSLDA-O | 10.39 (0.18) | 28.29 (1.73) | 19.44 (0.8) | 67.85 (2.43) |
| GSLDA-SO | 9.36 (0.17) | 19.87 (1.69) | 5.47 (0.71) | 73.05 (2.31) |
| Oracle | 4.62 (0.02) | 0 (0) | 0 (0) | 59 (0) |
From Tables 1–4, we observe that the two plug-in extensions of LDA, namely the naive Bayes and the NSC, perform worse than ℓ1 penalized logistic regression and other direct LDA methods under these settings. This is expected because there is substantial correlation among the features while both the plug-in extensions of LDA use diagonal estimates of Σ. In contrast, the performance of the direct LDA methods varies across the settings. For example, the DSDA has lower misclassification rates than the ROAD in most cases, while ROAD has better classification accuracy in Example 1. Utilizing the graph structures, high-dimensional LDA is further improved in GSLDA. As we can see from the results, GSLDA methods have the best performance among all methods in these four settings. In particular, the GSLDA method has lower misclassification rates than all other methods except its semi-supervised variant. Since the DSDA is the special case of the GSLDA with an empty graph, it is a good benchmark to quantify the benefit of using graph structures. In most cases, the GSLDA provides better model selection than the DSDA. Therefore, utilizing the graph structure does help us to improve the LDA classifier in high dimensions.
With respect to the semi-supervised GSLDA, due to the large amount of unlabeled data, it often has better graph estimation and yields more accurate classifiers. In fact, the semi-supervised GSLDA has the lowest misclassification rates among all methods in all cases. Furthermore, the semi-supervised GSLDA has superior model selection over the original GSLDA in most cases. This demonstrates the advantages of using unlabeled data.
We notice that models estimated by the semi-supervised GSLDA often have larger sizes, sometimes more false positives in coefficient vectors, than the original GSLDA classifiers. This is probably because the graph used in the semi-supervised GSLDA often has more edges. There are two possible reasons: (i) the true graph corresponding to has more edges than , and (ii) graph estimation based on unlabeled data uses a much larger training dataset which often leads to denser graphs estimate. While a denser graph estimate may recover more connections among the features, it can also result in more false edges. This effect is enhanced by the difficulty of graph estimation with unlabeled data. As a consequence, the semi-supervised GSLDA may suffer from more false positives, as shown in Examples 2 and 3. To resolve this issue, we may consider to use more conservative graph estimation for the semi-supervised GSLDA.
6. Real data analysis
In this section, we implement our methods and several other existing classifiers on two real datasets. The first dataset is a genetic dataset with very high dimensions, and the second one consists of images of handwritten digits. We estimate the graphs from labeled training data and unlabeled data. We find that GSLDA methods have a good performance in both datasets and utilizing the feature structure is beneficial.
6.1. Arcene cancer data
Nowadays, genetic diagnosis is an important tool in the clinical study and medical practice. By using the genetic information, we can estimate the potential risk of cancer for healthy people or determine cancer subtypes for patients. The Arcene dataset is a gene dataset of 88 cancer patients and 112 healthy individuals. The dataset contains 10,000 features and was originally used in the NIPS 2003 feature selection challenge (https://archive.ics.uci.edu/ml/data\-sets/Arcene). Out of the 10,000 features, 7000 are real genes while the other 3000 are noise features that have no predictive power and make the prediction harder. Besides the labeled data, there is an unlabeled dataset of 700 individuals, which is used to construct a graph for GSLDA-S. As in the previous simulation studies, we apply the GSLDA and other methods on the dataset.
The labeled data are randomly split into a training set and a test set, of sizes 150 and 50, respectively. All methods except the naive Bayes are tuned by 10-fold cross validation. The experiment is repeated 100 times and the results are summarized in Table 5.
Table 5:
Comparison of GSLDA and other methods on the Arcene dataset.
| Error | Size | |
|---|---|---|
| NB | 35.50 (0.62) | — |
| NSC | 36.05 (0.61) | 9934.46 (9.06) |
| SLDA | 34.64 (0.73) | 297 (4.17) |
| PLR | 28.36 (0.65) | 16.57 (0.90) |
| DSDA | 28.29 (0.72) | 30.96 (2.69) |
| LPD | 31.59 (1.33) | 10.95 (3.58) |
| ROAD | 29.29 (0.64) | 31.86 (3.43) |
| PLDA | 34.36 (0.61) | 9.39 (1.63) |
| SSSC | 27.93 (0.83) | — |
| GSLDA | 22.57 (0.70) | 229.36 (6.39) |
| GSLDA-S | 24.50 (0.68) | 319.57 (8.37) |
From Table 5, we can see that both GSLDA and semi-supervised GSLDA outperform other methods in prediction. Although semi-supervised GSLDA uses more data for graph estimation, its performance is inferior to GSLDA for this application, possibly due to the difficulty of graph estimation based on unlabeled data. In addition, the size of the unlabeled dataset is not substantially larger than that of the labeled dataset. Compared with PLR, DSDA and ROAD, our methods have significantly larger model sizes. This may indicate that many genes are related to each other. It is likely that those genes contribute to cancer together, and including all of them in modeling can potentially make the classifier more robust. This characteristic may also contribute to the good performance of the proposed two GSLDA methods.
6.2. Semeion handwritten digits dataset
The Semeion dataset (https://archive.ics.uci.edu/ml/datasets/Semeion+Handwritten+Digit) consists of 1593 images of handwritten digits. Each digit is in the form of a 16 × 16 grayscale image and saved as a vector of 256 features. We take a subset of the dataset that only contains digits 1 and 7, which are generally difficult to distinguish. We randomly choose 40 images for training, and 80 for graph estimation of the semi-supervised GSLDA after removing labels. The remaining 200 images are used for testing. Other settings are the same as the cancer example. Table 6 gives a summary of the results.
Table 6:
Comparison of GSLDA and other methods on the Semeion dataset.
| Error | Size | |
|---|---|---|
| NB | 13.81 (0.34) | — |
| NSC | 15.21 (0.44) | 84.74 (11.31) |
| SLDA | 14.43 (0.67) | 20.23 (2.80) |
| PLR | 18.69 (0.88) | 9.46 (0.40) |
| DSDA | 13.76 (0.66) | 16.76 (1.01) |
| LPD | 17.15 (0.86) | 15.32 (0.91) |
| ROAD | 19.73 (0.98) | 15.38 (1.25) |
| SSSC | 13.97 (0.75) | — |
| GSLDA | 12.65 (0.61) | 28.46 (1.45) |
| GSLDA-S | 11.23 (0.56) | 33.28 (1.32) |
As shown in Table 6, the semi-supervised GSLDA has excellent performance for this problem. It has the lowest misclassification rate among all methods in comparison. The original GSLDA method also has good classification accuracy for this problem. Moreover, we can see that both GSLDA methods have larger model sizes than other direct LDA methods, as in the previous analysis in Section 6.1.
7. Discussion
With many extensions in the literature, LDA can be readily applied to high-dimensional classification problems. In particular, the direct approaches of high-dimensional LDA are attractive due to their simplicity and good performance. Under the standard setting of LDA problems, we explore the relationship between the graph structure of features and the optimal discriminant vector β*. Our study shows that, by taking advantage of such structure, we can get better LDA classifiers in high dimensions. Based on this idea, we propose the GSLDA method. After investigating the overall graph structure of the Gaussian mixture population for unlabeled data, we further propose the semi-supervised GSLDA that can utilize unlabeled data. Both GSLDA methods have been evaluated on simulated and real data, which demonstrate the advantages of utilizing the graph structures. Moreover, we conclude that the performance of semi-supervised GSLDA depends on both the size of the unlabeled dataset and the graph complexity. When the graph structure is very complex, it is better to consider a conservative graph estimate for GSLDA. Finally, our focus in this paper is on binary problems. It will be useful to extend the methods for multicategory problems.
Table 2:
Performance comparisons of different classification methods for Example 2.
| Error | FP | FN | Size | |
|---|---|---|---|---|
| NB | 36.59 (0.43) | — | — | — |
| NSC | 17.46 (0.14) | 42.75 (2.16) | 25.96 (0.45) | 55.79 (2.48) |
| SLDA | 14.39 (0.12) | 19.28 (1.72) | 17.59 (0.43) | 40.69 (2.17) |
| PLR | 7.86 (0.11) | 15.83 (0.42) | 20.58 (0.29) | 34.25 (0.54) |
| DSDA | 6.96 (0.09) | 25.13 (1.22) | 17.21 (0.38) | 46.92 (1.46) |
| LPD | 8.84 (0.69) | 34.48 (1.56) | 17.98 (0.48) | 55.50 (1.97) |
| ROAD | 7.42 (0.12) | 25.16 (0.98) | 17.36 (0.35) | 46.80 (1.17) |
| PLDA | 16.48 (0.12) | 2.26 (0.48) | 32.69 (0.14) | 8.57 (0.57) |
| SSSC | 9.27 (0.17) | — | — | — |
| GSLDA | 6.60 (0.10) | 25.48 (1.83) | 15.41 (0.43) | 49.07 (2.19) |
| GSLDA-S | 5.56 (0.07) | 34.43 (2.52) | 3.33 (0.41) | 70.1 (2.77) |
| GSLDA-O | 6.19 (0.09) | 27.26 (1.72) | 7.37 (0.47) | 58.89 (2.08) |
| GSLDA-SO | 5.79 (0.07) | 30.78 (1.94) | 2.16 (0.39) | 67.62 (2.31) |
| Oracle | 3.32 (0.01) | 0 (0) | 0 (0) | 39 (0) |
Table 3:
Performance comparisons of different classification methods for Example 3.
| Error | FP | FN | Size | |
|---|---|---|---|---|
| NB | 36.86 (0.80) | — | — | — |
| NSC | 24.16 (0.84) | 29.15 (3.35) | 44.78 (1.62) | 50.37 (4.93) |
| SLDA | 13.28 (0.72) | 21.07 (2.29) | 40.59 (1.57) | 46.48 (3.87) |
| PLR | 11.09 (0.12) | 21.44 (0.56) | 42.08 (0.39) | 45.36 (0.75) |
| DSDA | 10.94 (0.15) | 30.32 (1.49) | 38.12 (0.63) | 58.20 (2.01) |
| LPD | 13.19 (0.73) | 41.67 (1.52) | 39.84 (0.82) | 67.83 (2.25) |
| ROAD | 11.25 (0.15) | 33.14 (1.46) | 37.53 (0.55) | 61.61 (1.92) |
| PLDA | 26.31 (0.68) | 22.34 (2.33) | 50.89 (1.12) | 37.45 (3.41) |
| SSSC | 13.57 (0.91) | — | — | — |
| GSLDA | 10.53 (0.10) | 27.34 (1.91) | 36.67 (0.85) | 56.67 (2.67) |
| GSLDA-S | 8.77 (0.08) | 34.08 (2.77) | 18.2 (0.72) | 81.88 (3.37) |
| GSLDA-O | 9.77 (0.08) | 36.87 (2.54) | 26.22 (0.78) | 76.65 (3.27) |
| GSLDA-SO | 8.91 (0.08) | 35.17 (2.37) | 16.31 (0.63) | 84.86 (3.01) |
| Oracle | 5.36 (0.02) | 0 (0) | 0 (0) | 66 (0) |
Acknowledgments
The authors would like to thank the Editor-in-Chief, Christian Genest, the Associate Editor, and reviewers for their valuable comments and suggestions which led to a much improved presentation. This research was supported in part by National Science Foundation Grants IIS1632951, DMS1821231, and National Institute of Health Grant R01GM126550.
Appendix A. Some comments on the GSLDA method
Appendix A.1. A graphical display of the discriminant vector decomposition
Appendix A.2. Connection between GSLDA and existing methods
We first consider the case when is a complete graph. Without loss of generality, we assume that there is a unique minimum weight, i.e., there exists an ℓ such τℓ < τj for all j ≠ ℓ. In this case, for any and v(1) + ⋯ + v(p) = β, we have
By taking v(l) = β and v(j) = 0 for all j ≠ ℓ, the regularization (5) becomes . Similarly, we can show the equivalence in the case where consists of K disjoint complete subgraphs.
Figure A.1:

A 3-dimensional LDA example demonstrating how marginal differences of the three features (δ1, δ2, δ3) contribute to the predictive power of all features. Here ω23 = ω32 = 0. The terms around each node represent a decomposition of the corresponding coefficient. The gray scale of each term and the edge direction together indicate the source of the marginal differences.
Appendix B. Numerical results
Appendix B.1. Graph estimation results
To better understand the performance of our proposed GSLDA methods, we also present the graph estimation results for the methods. In particular, we compare the graph estimation based on both labeled data (for supervised GSLDA) and unlabeled data (for semi-supervised GSLDA) with the true graphs, within-class graph and overall graph . The accuracy metrics include false positives (TP) and false positives (FP).
Table B.1:
Graph estimation accuracy for all examples in the simulations. The graphs are estimated with labeled data (L) after centering, or with unlabeled data (U). The former estimation is compared with , and the latter is compared with both and . The results are averaged over 100 repetitions and the standard errors are provided in the parentheses.
| Graph Type | Data | TP | FP | Size | True Size |
|---|---|---|---|---|---|
| Block Sparse | L | 51.54 (0.25) | 6.86 (0.48) | 58.4 (0.55) | |
| U | 100 (0) | 82.96 (0.66) | 182.96 (0.66) | ||
| U | 176.48 (0.46) | 6.48 (0.38) | 182.96 (0.66) | ||
| AR(3) | L | 468.02 (1.31) | 69.64 (1.24) | 537.66 (1.34) | |
| U | 1178.04 (0.38) | 76.72 (0.87) | 1254.76 (1.01) | ||
| U | 1235.76 (0.75) | 19 (0.61) | 1254.76 (1.01) | ||
| Random Sparse | L | 353.14 (2.02) | 69.34 (1.25) | 422.48 (1.73) | |
| U | 814.52 (0.18) | 72.74 (1.04) | 887.26 (1.12) | ||
| U | 866.14 (0.87) | 21.12 (0.70) | 887.26 (1.12) | ||
| Scale-Free | L | 374.92 (1.37) | 32.44 (0.92) | 407.36 (1.48) | |
| U | 709.88 (0.73) | 103.5 (1.00) | 813.38 (1.18) | ||
| U | 799.08 (1.05) | 14.3 (0.53) | 813.38 (1.18) | ||
Appendix B.2. Additional simulation results
The misclassification rates may not reflect the comprehensive performance of classification models, especially when the classes are unbalanced. Thus we present the receiver operating characteristic (ROC) curve for the classification models. Besides the balanced class setting as in the main text, we also consider an unbalanced class setting in which Class-0 accounts for 80% of the whole dataset. As we can see from Figures B.1 and B.2, our methods still outperforms other methods in terms of higher sensitivities at each specificity level.
Figure B.1:

ROC Curve under the balanced setting for the four examples. The proportion of Class-0 sample is 50%. The ROC curve is computed based on 100 repetitions.
Figure B.2:

ROC Curve under the unbalanced setting for the four examples. In particular, the proportion of Class-0 sample is 80%. The ROC curve is computed based on 100 repetitions.
Appendix C. Proofs to the theoretical results
Appendix C.1. Proof of Proposition 1
The random variable X can be represented as X = ξZ1 + (1 − ξ)Z2, where (i) ξ ~ Bin(1, π1) is a Bernoulli random variable and (ii) Z1 and Z2 are from the two population components, respectively. Moreover, ξ, Z1, and Z2 are mutually independent. We have var(Z1) = var(Z2) = Σ, EZ1 = μ(1), and EZ2 = μ(2). Then E(X) = π1μ(1) + π2μ(2) and
Thus the overall covariance matrix is var(X) = Σ + π1π2δδ⊺, where δ = μ(1) − μ(2).
Now we verify the inverse matrix of var(X), i.e., the overall precision matrix of the mixture distribution. By setting c = π1π2/(1 + π1π2δ⊺Σ−1δ), we have
Denote β* = Σ−1δ. Then we have var(X)−1 = Σ−1 − cβ*β*⊺. □
Appendix C.2. Proof of Theorem 1
Before the proof, we introduce a lemma from [7]. The proof is omitted.
Lemma 4. Let ξ1, … , ξn be independent random variables with mean zero. Suppose that there exists some t > 0 and such that . Set Ct = t + t−1. Then uniformly for ,
Denote , , and . With simple calculations, one can show that for all ,
where is the centralized feature vector. Thus the loss function of GSLDA in (4) is equivalent to . In the rest of our proof, we assume the sample X has been centralized. Then the GSLDA formulation (4) becomes
| (C.1) |
Under the assumption (A1), we can define
| (C.2) |
where denotes the subgraph of corresponding to A. If we can show that (i) all elements of are non-zero; and (ii) with and solves (C.1); then, GSLDA estimation recovers all significant features accurately.
We first show statement (i). By Section 4.6 of [31], the formulation (C.2) is equivalent to where
Since this is a convex optimization problem, any solution {u(j) : j ∈ A} satisfies the KKT conditions [4], which are for all j ∈ A, either
or
where γ = Σj∈A u(j). Thus we have , and we can write as , where satisfies ||tA||∞ ≤ 1. We have
in which the second inequality holds for sufficiently large n because φξ1 ≤ 1 and ξ2 ≤ (1 − φξ1)−1 φ2ξ1. If ξ1 ≤ ϵ and , then L1 = O(ϵ) + λτ*φ/2 > 0, which provides (i). By Lemma 4, the statement (i) is true with probability at least 1 − 2s2 exp(−a1nϵ2/s2) − 2sexp(−a2nϵ2), for some positive a1 and a2.
Now we prove statement (ii). The formulation (C.1) is equivalent to , where
| (C.3) |
This is also a convex optimization problem and the KKT conditions of formulation (C.3) are for all j ∈ {1, … , p}, either
| (C.4) |
or
| (C.5) |
where β = v(1) + ⋯ + v(p). Let v(j) = 0 for all j ∈ AC, and for all j ∈ A. Then and βAC = 0.
For j ∈ A, (C.4) holds owing to the definition of . For j ∈ AC, . Denote if and , then
By Lemma 4, the statement (ii) is true with probability at least 1 − 2ps exp(−a1nϵ2/s2)−2p exp(−a2nϵ2). By taking , the active set is recovered and with probability at least 1 − 2s2 exp(−a1nϵ2/s2)−2ps exp(−a1nϵ2/s2) − 2p exp(−a2nϵ2) = 1 − O(p−C1) for some C1 > 0. □
Appendix C.3. Proof of Theorem 2
The proof uses the following lemma from [30].
Lemma 5. Denote a subspace of and its orthogonal complement. For a regularized estimation problem
where
-
(i)
R is a norm and is decomposable with respect to (), i.e., R(θ + η) = R(θ) + R(η) for all ;
-
(ii)
L is convex and differentiable, and satisfies restricted strong convex condition with curvature κL, i.e., for some θ*, for all Δ such that .
Let λ ≥ 2R*{∇L(θ*)}, where R* denote the dual norm of R, then any solution to the problem satisfies
where .
Proof. In the GSLDA formulation, the loss function is , and the regularization is . It has been shown in [31] that R is a norm and its dual norm is . When we take .
For some ϵ > 0 denote the event . Then by Lemma 4, . Under the event ,
We take where C2 > (a1 ∧ a2)−1. Then . Under the event with ϵ ≤ cσ for some c > 0, for sufficiently large n, we have for . Thus for all .
We take . Then and . Moreover,
Therefore, by Lemma 5, we have
with probability at least 1 − 2ps exp(−a1nϵ2) − 2p exp(−a2nϵ2) ≥ 1 − sp−C3 where C3 = C2(a1 ∨ a2) − 1 > 0. □
Appendix C.4. Proof of Theorem 3
We use the same notations as in the proof above. Without loss of generality, we assume μ(1) + μ(2) = 0, then μ(1) = δ/2, μ(2) = −δ/2, and . According to Proposition 2, we have
We will use the following property of standard Gaussian distribution function [6]:
| (C.6) |
For k ∈ {1, 2}, let
Since , it suffices to verify the orders of r(k) and Δ.
According to Theorem 2, with probability going to 1. Moreover, by the definition of β†, we have β† = 4/(4 + Δ)β*, and β†T Σβ† = 16Δ/(4 + Δ)2. Since
for sufficiently large n, we have
in which the first inequality holds because for sufficiently large n. Moreover, under we have,
Therefore,
Using the property (C.6), then we have
Since , , and , we have r(1)Δ1/2 → 0 and thus . Similarly, we can show , which proves the theorem.
Footnotes
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Bickel PJ, Levina E, Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations, Bernoulli 10 (2004) 989–1010. [Google Scholar]
- [2].Bishop CM, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag, Secaucus, NJ, 2006. [Google Scholar]
- [3].Bondell HD, Reich BJ, Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar, Biometrics 64(2008) 115–123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Boyd S, Vandenberghe L, Convex Optimization, Cambridge University Press, Cambridge, 2004. [Google Scholar]
- [5].Cai D, He X, Han J, Semi-supervised discriminant analysis, In: 2007 IEEE 11th International Conference on Computer Vision, IEEE, 1–7. [Google Scholar]
- [6].Cai T, Liu W, A direct estimation approach to sparse linear discriminant analysis, J. Amer. Statist. Assoc 106 (2011) 1566–1577. [Google Scholar]
- [7].Cai T, Liu W, Luo X, A constrained ℓ1 minimization approach to sparse precision matrix estimation, J. Amer. Statist. Assoc 106 (2011) 594–607. [Google Scholar]
- [8].Chen J, Chen Z, Extended bayesian information criteria for model selection with large model spaces, Biometrika 95 (2008) 759–771. [Google Scholar]
- [9].Chen S, Witten DM, Shojaie A, Selection and estimation for mixed graphical models, Biometrika 102 (2014) 47–64. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Clemmensen L, Hastie T, Witten D, Ersbpll B, Sparse discriminant analysis, Technometrics 53 (2011) 406–413. [Google Scholar]
- [11].Fan J, Fan Y, High dimensional classification using features annealed independence rules, Ann. Statist 36 (2008) 2605–2637. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Fan J, Feng Y, Tong X, A road to classification in high dimensional space: The regularized optimal affine discriminant, J. R. Stat. Soc. Ser. B Stat. Methodol. 74 (2012) 745–771. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Fisher RA, The use of multiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179–188. [Google Scholar]
- [14].Friedman J, Hastie T, Tibshirani RJ, Sparse inverse covariance estimation with the graphical lasso, Biostatistics 9 (2008) 432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Hand DJ, Classifier technology and the illusion of progress, Statist. Sci 21 (2006) 1–14. [Google Scholar]
- [16].Hastie T, Tibshirani RJ, Buja A, Flexible discriminant analysis by optimal scoring, J. Amer. Statist. Assoc 89 (1994) 1255–1270. [Google Scholar]
- [17].Hastie T, Tibshirani RJ, Friedman J, Elements of Statistical Learning Data Mining, Inference, and Prediction, 2nd ed., Springer, New York, 2009. [Google Scholar]
- [18].Kim S, Pan W, Shen X, Network-based penalized regression with application to genomic data, Biometrics 69 (2013) 582–593. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Li C, Li H, Network-constrained regularization and variable selection for analysis of genomic data, Bioinformatics 24 (2008) 1175–1182. [DOI] [PubMed] [Google Scholar]
- [20].Liu B, Shen X, Pan W, Semi-supervised spectral clustering with application to detect population stratification, Frontiers in Genetics 4 (2013) 215. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Liu Y, Yuan M, Reinforced multicategory support vector machines, J. Comput. Graph. Statist 20 (2011) 901–919. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Luo S, Chen Z, Edge detection in sparse gaussian graphical models, Comput. Statist.Data Anal. 70 (2014) 138–152. [Google Scholar]
- [23].Luo S, Chen Z, Sequential lasso cum ebic for feature selection with ultra-high dimensional feature space, J. Amer. Statist. Assoc 109 (2014) 1229–1240. [Google Scholar]
- [24].Mai Q, Yang Y, Zou H, Multiclass sparse discriminant analysis, arXiv preprint arXiv:1504.05845 (2015). [Google Scholar]
- [25].Mai Q, Zou H, A note on the connection and equivalence of three sparse linear discriminant analysis methods, Technometrics 55 (2013) 243–246. [Google Scholar]
- [26].Mai Q, Zou H, Yuan M, A direct approach to sparse discriminant analysis in ultra-high dimensions, Biometrika 99 (2012) 29–42. [Google Scholar]
- [27].Meier L, van de Geer S, Buhlmann P, The group lasso for logistic regression, J. R. Stat. Soc. Ser. B Stat. Methodol. 70 (2008) 53–71. [Google Scholar]
- [28].Meinshausen N, Buhlmann P, High-dimensional graphs and variable selection with the lasso, Ann. Statist 34 (2006) 1436–1462. [Google Scholar]
- [29].Min W, Liu J, Zhang S, Network-regularized sparse logistic regression models for clinical risk prediction and biomarker discovery, IEEE/ACM Trans. Comput. Biol. Bioinform. (TCBB) 15 (2018) 944–953. [DOI] [PubMed] [Google Scholar]
- [30].Negahban SN, Ravikumar P, Wainwright MJ, Yu B, A unified framework for high-dimensional analysis of M-estimators with decomposable regularizers, Statist. Sci 27 (2012) 538–557. [Google Scholar]
- [31].Obozinski G, Jacob L, Vert J-P, Group lasso with overlaps: The latent group lasso approach, arXiv preprint arXiv:1110.0413 (2011). [Google Scholar]
- [32].Pan W, Shen X, Penalized model-based clustering with application to variable selection, J. Machine Learn. Res. 8 (2007) 1145–1164. [Google Scholar]
- [33].Pan W, Xie B, Shen X, Incorporating predictor network in penalized regression with application to microarray data, Biometrics 66 (2010) 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [34].Pang H, Liu H, Vanderbei R, The fastclime package for linear programming and large-scale precision matrix estimation in R, J. Machine Learn. Res. 15 (2014) 489–493. [PMC free article] [PubMed] [Google Scholar]
- [35].Shao J, Wang Y, Deng X, Wang S, Sparse linear discriminant analysis by thresholding for high dimensional data, Ann. Statist 39 (2011) 1241–1265. [Google Scholar]
- [36].Tibshirani RJ, Hastie T, Narasimhan B, Chu G, Diagnosis of multiple cancer types by shrunken centroids of gene expression, Proc. Nat. Acad. Sci 99 (2002) 6567–6572. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [37].Vanderbei RJ, Linear Programming: Foundations and Extensions, 4th ed., Springer, 2014. [Google Scholar]
- [38].Voorman A, Shojaie A, Witten D, Graph estimation with joint additive models, Biometrika 101 (2013) 85–101. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [39].Witten DM, Tibshirani RJ, Penalized classification using Fisher’s linear discriminant, J. R. Stat. Soc. Ser. B Stat. Methodol. 73 (2011) 753–772. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [40].Wu M, Zhu L, Feng X, Network-based feature screening with applications to genome data, Ann. Appl. Statist 12 (2018) 1250–1270. [Google Scholar]
- [41].Wu MC, Zhang L, Wang Z, Christiani DC, Lin X, Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection, Bioinformatics 25 (2009) 1145–1151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [42].Yang S, Yuan L, Lai Y-C, Shen X, Wonka P, Ye J, Feature grouping and selection over an undirected graph, ACM, 2012, 922–930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Yang Y, Zou H, A fast unified algorithm for solving group-lasso penalize learning problems, Stat. Comput 25 (2015) 1129–1141. [Google Scholar]
- [44].Yu G, Liu Y, Sparse regression incorporating graphical structure among predictors, J. Amer. Statist. Assoc 111 (2016) 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [45].Yuan M, Lin Y, Model selection and estimation in regression with grouped variables, J. R. Stat. Soc. Ser. B Stat. Methodol. 68 (2006) 49–67. [Google Scholar]
- [46].Yuan M, Lin Y, Model selection and estimation in the Gaussian graphical model, Biometrika 94 (2007) 19–35. [Google Scholar]
- [47].Zhang C, Liu Y, Multicategory large-margin unified machines, J. Machine Learn. Res. 14 (2013) 1349–1386. [PMC free article] [PubMed] [Google Scholar]
- [48].Zhang C, Liu Y, Wang J, Zhu H, Reinforced angle-based multicategory support vector machines, J. Comput. Graph. Statist 25 (2016) 806–825. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Zhang W, Wan Y-W, Allen GI, Pang K, Anderson ML, Liu Z, Molecular pathway identification using biological network-regularized logistic models, BMC Genomics 14 (2013) S7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [50].Zhao P, Yu B, On model selection consistency of lasso, J. Machine Learn. Res. 7 (2006) 2541–2563. [Google Scholar]
- [51].Zhao S, Shojaie A, A significance test for graph-constrained estimation, Biometrics 72 (2016) 484–493. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [52].Zhou H, Pan W, Shen X, Penalized model-based clustering with unconstrained covariance matrices, Electron. J. Statist 3 (2009) 1473–1496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [53].Zhu Y, Shen X, Pan W, Simultaneous grouping pursuit and feature selection over an undirected graph, J. Amer. Statist. Assoc 108 (2013) 713–725. [DOI] [PMC free article] [PubMed] [Google Scholar]
