Network-Based Logistic Classification with an Enhanced L 1/2 Solver Reveals Biomarker and Subnetwork Signatures for Diagnosing Lung Cancer

Hai-Hui Huang; Yong Liang; Xiao-Ying Liu

doi:10.1155/2015/713953

. 2015 Jun 16;2015:713953. doi: 10.1155/2015/713953

Network-Based Logistic Classification with an Enhanced L _1/2 Solver Reveals Biomarker and Subnetwork Signatures for Diagnosing Lung Cancer

Hai-Hui Huang ¹, Yong Liang ^1,^*, Xiao-Ying Liu ¹

PMCID: PMC4488258 PMID: 26185761

Abstract

Identifying biomarker and signaling pathway is a critical step in genomic studies, in which the regularization method is a widely used feature extraction approach. However, most of the regularizers are based on L ₁-norm and their results are not good enough for sparsity and interpretation and are asymptotically biased, especially in genomic research. Recently, we gained a large amount of molecular interaction information about the disease-related biological processes and gathered them through various databases, which focused on many aspects of biological systems. In this paper, we use an enhanced L _1/2 penalized solver to penalize network-constrained logistic regression model called an enhanced L _1/2 net, where the predictors are based on gene-expression data with biologic network knowledge. Extensive simulation studies showed that our proposed approach outperforms L ₁ regularization, the old L _1/2 penalized solver, and the Elastic net approaches in terms of classification accuracy and stability. Furthermore, we applied our method for lung cancer data analysis and found that our method achieves higher predictive accuracy than L ₁ regularization, the old L _1/2 penalized solver, and the Elastic net approaches, while fewer but informative biomarkers and pathways are selected.

1. Introduction

Identifying molecular biomarker or signaling pathway involved in a phenotype is a particularly important problem in genomic studies. Logistic regression is a powerful discriminating method and has an explicit statistical interpretation which can obtain probabilities of classification regarding the class label information.

A key challenge in identifying diagnosis or prognosis biomarkers using the logistic regression model is that the number of observations is much smaller than the size of measured biomarkers in most of the genomic studies. Such limitation causes instability in the algorithms used to select gene marker. Regularization methods have been widely used in order to deal with this problem of high dimensionality. For example, Shevade and Keerthi proposed the sparse logistic regression based on the Lasso regularization [1, 2]. Meier et al. investigated logistic regression with group Lasso [3]. Usually, the Lasso type procedures are often called L ₁-norm type regularization methods. However, L ₁ regularization may yield inconsistent selections when applied to variable selection in some situations [4] and often introduces the extra bias in the estimation [5]. In many genomic studies, we need a sparser solution for interpretation and accurate outcomes, but L ₁ regularization has a gap to meet these requirements. Thus, a further improvement of regularization is urgently required. L _q (0 < q < 1) regularization can assuredly generate more sparse and precise solutions than L ₁ regularization. Moreover, L _1/2 penalty can be taken as a representative of L _q (0 < q < 1) penalty and has demonstrated many attractive properties which do not appear in some L ₁ regularization approaches, such as unbiasedness, sparsity, and oracle properties [6–8].

So far, we observed dense molecular interaction information about the disease-related biological processes and gathered it through databases focused on many aspects of biological systems. For example, BioGRID records collected various biological interactions from more than 43,468 publications [9]. These regulatory relationships are usually represented by a network. Combining these pieces of graphic information extracted from the biological process with an analysis of the gene-expression data had provided useful prior information to detective noise and removes confounding factors from biological data for several classification and regression models [10–14].

Inspired by the aforementioned methods and ideas, here, we define a network-constrained logistic regression model with L _1/2 penalty following the framework established by [11], where the predictors are based on the gene-expression data with biologic network knowledge. The proposed model is aimed at identifying some biomarkers and subnetworks regarding diseases. In order to achieve a better prediction, we use an enhanced half thresholding algorithm for L _1/2 regularization, which is more efficient than the old half thresholding approach in the literature [6, 15, 16].

The rest of the paper is organized as follows. In Section 2, we proposed a new version of the network-constrained logistic regression model with L _1/2 regularization. In Section 3, we presented an enhanced half thresholding method for L _1/2 regularization and the corresponding coordinate descent algorithm. In Section 4, we evaluated the performance of our proposed approach on the simulated data and presented the applications of the proposed methods to an analysis of lung cancer data. We concluded the paper with Section 5.

2. L _1/2 Penalized Network-Constrained Logistic Regression Model

Generally, assuming that dataset D has n samples, D = {(X ₁, y ₁), (X ₂, y ₂),…, (X _n, y _n)}, where X _i = (x _i1, x _i2,…, x _ip) is ith sample with p genes and y _i is the corresponding variable that takes a value of 0 or 1. Define a classifier f(x) = e ^x/(1 + e ^x) and the logistic regression is defined as

\begin{matrix} P (y_{i} = 1 ∣ X_{i}) = f ({X^{'}}_{i} β) = \frac{\exp ({X^{'}}_{i} β)}{1 + \exp ({X^{'}}_{i} β)}, \end{matrix}

(1)

where β = (β ₁,…, β _p) are the coefficients to be estimated. We can obtain β by minimizing the log-likelihood function of the logistic regression. Following [11], to combine biological network with an analysis of the gene microarray data, we used a Laplacian constraint approach here. Consider a graph G = (V, E), where V is the set of genes that meet p explanatory variables and E is the set of edges. If gene u and gene v are connected, then there is an edge between gene u and gene v, which is denoted by E _uv = 1; else E _uv = 0. w _uv denotes the weight of edge E _uv. The normalized Laplacian matrix L for G is defined by

\begin{matrix} L_{u v} = \{\begin{matrix} 1 - \frac{w_{u v}}{d_{u}} & if u = v, d_{u} \neq 0, \\ - \frac{w_{u v}}{\sqrt{d_{u} d_{v}}} & if u, v are  adjacent \\ 0 & otherwise, \end{matrix} \end{matrix}

(2)

where d _u and d _v are the degrees of genes u and v, respectively. The degrees of gene u (or v) describe the number of the edges that connected with u (or v). For λ ≥ 0, the network-constrained logistic regression model is presented as

\begin{matrix} L (λ, β) = - \sum_{i = 1}^{n} \{y_{i} \log [f ({X^{'}}_{i} β)] + (1 - y_{i}) \log [1 - f ({X^{'}}_{i} β)]\} + λ β^{T} L β, \end{matrix}

(3)

where the first term in (3) is the log-likelihood function of the logistic model and the second term is a network constraint based on the Laplacian matrix, which induces a smooth solution of β on the graph.

Directly computing (3) performs poorly for both prediction and biomarker selection purposes when the gene number p ≫ the sample size n. Therefore, the regularization approach is vitally needed. When adding a regularization term to (3), the sparse network-constrained logistic regression can be written as

\begin{matrix} L (λ_{1}, λ_{2}, β) = - \sum_{i = 1}^{n} \{y_{i} \log [f ({X^{'}}_{i} β)] + (1 - y_{i}) \log [1 - f ({X^{'}}_{i} β)]\} + λ_{1} \sum_{j = 1}^{p} P (β_{j}) \\ + λ_{2} β^{T} L β, \end{matrix}

(4)

where λ ₁ > 0 is a regularization parameter. In Zhang et al. [13], the authors used Lasso (L ₁) which has the regularization term P(β) = ∑_j=1 ^p|β _j| to penalize (4). However, the result of the Lasso type (L ₁) regularization is not good enough for interpretation, especially in genomic research. Besides this, L ₁ regularization is asymptotically biased [17, 18]. To improve the solution's sparsity and its predictive accuracy, we need to think beyond L ₁ regularization to L _q penalties. In mathematics, L _q (0 < q < 1) type regularization |β|_q = ∑|β|^q with the lower value of q would lead to better solutions with more sparsity and gives asymptotically unbiased estimates [17]. Moreover, L _1/2 penalty can be taken as a representative of L _q (0 < q < 1) penalty and has permitted an analytically expressive thresholding representation [6, 7]. Therefore, we proposed a novel L _1/2 net approach based on L _1/2 regularization to penalize the network-constrained logistic regression model, as shown in

\begin{matrix} L (λ_{1}, λ_{2}, β) = - \sum_{i = 1}^{n} \{y_{i} \log [f ({X^{'}}_{i} β)] + (1 - y_{i}) \log [1 - f ({X^{'}}_{i} β)]\} + λ_{1} {|β|}_{1 / 2} \\ + λ_{2} β^{T} L β, \end{matrix}

(5)

where |β|_1/2 = ∑_j=1 ^p|β _j|^1/2.

3. A Coordinate Descent Algorithm for the Network-Constrained Logistic Model with the Enhanced L _1/2 Thresholding Operator

L _1/2 penalty function is nonconvex, which raises numerical challenges in fitting the models. Recently, the coordinate descent algorithms [19] for solving nonconvex regularization models (SCAD [20], MCP [21]) have shown significant efficiency and convergence [22]. Since the computational burden increases only linearly with the feature number p, the coordinate descent algorithm can be a powerful tool for solving high-dimensional problems. Its standard procedure can be demonstrated as follows: for every coefficient β _j (j = 1,2,…, p), to partially optimize the target function with respect to β _j, and fix the remaining elements β _k(k = 1,2,…, p and k ≠ j) at their most recently updated values. The specific form of updating β depends on the thresholding operator of the penalty.

In this paper, we present an enhanced L _1/2 thresholding operator for the coordinate descent algorithm:

\begin{matrix} β_{j} \\ = Enhanced_Half (ω_{j}, λ) \\ = \{\begin{matrix} \frac{2}{3} ω_{j} (1 + \cos (\frac{2 (π - φ_{λ} (ω_{j}))}{3})) & if |ω_{j}| > \frac{\sqrt[3]{54}}{4} {(λ)}^{2 / 3} \\ 0 & otherwise, \end{matrix} \end{matrix}

(6)

where φ _λ(ω) = arccos⁡((λ/8)(|ω|/3)^−3/2), π = 3.14, $ω_{j} = \sum_{i = 1}^{n} x_{i j} (y_{i} - {\tilde{y}}_{i}^{(j)})$ , and ${\tilde{y}}_{i}^{(j)} = \sum_{k \neq j} x_{i k}^{} β_{k}$ as the partial residual for fitting β _j.

Remark. This enhanced L _1/2 thresholding operator $(\sqrt[3]{54} / 4) {(λ)}^{2 / 3}$ outperforms the old L _1/2 thresholding (3/4)(λ)^2/3 introduced in [6, 15, 16]. We know that the quantity of the regularization solutions depends seriously on the value of the regularization parameter λ. Based on this enhanced L _1/2 thresholding operator, when λ is chosen by some efficient strategies for the parameter tuning, such as cross validation, the convergence of algorithm (6) is proved [7].

The Laplacian matrix L is nonnegative definite; thus, it can be written as L = SS ^T by Cholesky decomposition. Following C. Li and H. Li [11] approach, (4) can be expressed as

\begin{matrix} L (λ_{1}, λ_{2}, β) = L (γ, β^{*}) = - \sum_{i = 1}^{n} \{y_{i}^{*} \log [f ({X^{*}}_{i} β^{*})] + (1 - y_{i}^{*}) \log [1 - f ({X^{*}}_{i} β^{*})]\} + \sum_{j = 1}^{p} {γ |β_{j}^{*}|}^{1 / 2}, \end{matrix}

(7)

where ${X^{*}}_{(n + p) \times p} = {(1 + λ_{2})}^{- 1 / 2} (\begin{matrix} X \\ \sqrt{λ_{2}} S^{T} \end{matrix})$ , $Y_{(n + p)}^{*} = (\begin{matrix} Y \\ 0 \end{matrix})$ , $β^{*} = \sqrt{1 + λ_{2}} β$ , and γ is the regularization parameter and can be expressed as $γ = λ_{1} / \sqrt{1 + λ_{2}}$ .

One-term Taylor series expansion for (7) can be written as

\begin{matrix} L (γ, β^{*}) \approx \frac{1}{2 n} \sum_{i = 1}^{n} {(Z_{i} - {X^{*}}_{i} β^{*})}^{'} W_{i} (Z_{i} - {X^{*}}_{i} β^{*}) \\ + \sum_{j = 1}^{p} P ({β^{*}}_{j}), \end{matrix}

(8)

where $Z_{i} = {X^{*}}_{i} \tilde{β^{*}} + (y_{i}^{*} - f ({X^{*}}_{i} \tilde{β^{*}})) / f ({X^{*}}_{i} \tilde{β^{*}}) (1 - f ({X^{*}}_{i} \tilde{β^{*}})$ is the estimated response and $W_{i} = f ({X^{*}}_{i} \tilde{β^{*}}) (1 - f ({X^{*}}_{i} \tilde{β^{*}})$ is the weight for the estimated response. $f ({X^{*}}_{i} \tilde{β^{*}}) = \exp ({X^{*}}_{i} \tilde{β^{*}}) / (1 + \exp ({X^{*}}_{i} \tilde{β^{*}}))$ is the evaluated value under the current parameters. Thus, we can redefine the partial residual for fitting current $\tilde{β^{*}}$ as ${\overset{ˇ}{Z}}_{i}^{(j)} = \sum_{i}^{n} W_{i} ({\tilde{Z}}_{i} - \sum_{k \neq j} x_{i k}^{*} {\tilde{β^{*}}}_{k})$ and $ω_{j} = \sum_{i = 1}^{n} x_{i j}^{*} (Z_{i} - {\overset{ˇ}{Z}}_{i}^{(j)})$ . The procedure of the coordinate descent algorithm for L _1/2 penalized network-constrained logistic model is described as follows.

Algorithm 1 (the coordinate descent algorithm for L _1/2 penalized network-constrained logistic model). —

We consider the following.

Step 1. Initialize all β _j(m) ← 0 (j = 1,2,…, p) and y ^∗, X ^∗, and set m ← 0, γ chosen by cross validation.

Step 2. Calculate Z(m) and W(m) and approximate the loss function (8) based on the current β(m).

Step 3. Update each β _j(m) and cycle over j = 1,…, p, until β _j(m) does not change.

Step 3.1. Compute Z _i ^(j)(m) ← ∑_i=1 ⁿ W _i(m)(Z _i(m) − ∑_k≠j x _ik ^∗ β _k(m)) and $ω_{j} (m) \leftarrow \sum_{i = 1}^{n} x_{i j} (Z_{i} (m) - {\overset{ˇ}{Z}}_{i}^{(j)} (m))$ .

Step 3.2. Update β _j(m) ← Enhanced_Half(ω _j(m), γ).

Step 4. Let m ← m + 1, β(m + 1) ← β(m).

If β(m) dose not converge, then repeat Steps 2 and 3.

4. Simulation and Application

4.1. Analyses of Simulated Data

We evaluate the performance of four methods: the network-constrained logistic regression models with L ₁ regularization (L ₁ net), L _1/2 regularization with old thresholding value (3/4)(λ)^2/3 (L _1/2 net) and with the enhanced thresholding value $(\sqrt[3]{54} / 4) {(λ)}^{2 / 3}$ (enhanced L _1/2 net), and the Elastic net regularization approach (Elastic net). We first simulated the graph structure to mimic gene regulatory network: assuming that the graph consists of 200 independent transcription factors (TFs) and each TF regulates 10 unlike genes, so there are a total of 2200 variables, X = (x ₁, x ₂,…, x _p), p = 2200. The training and the independent test data sets include the sample sizes of 100, respectively. Each TF x _n and its regulated genes x _m were generated by the normal distribution N(0,1). We set the correlation rate between x _n and its regulated gene x _m as 0.75, x _m = (1 − 0.75) × x _m + (0.75) × x _n. The binary responder y _i (1 ≤ i ≤ 100), which is associated with the matrix X of TFs and their regulated genes, is calculated based on the following formula and rule:

\begin{matrix} y_{i} = 1 (Label 1), \\ if P (y_{i} = 1 ∣ X_{i}) = \frac{\exp (X_{i} β + ϵ)}{1 + \exp (X_{i} β + ϵ)} ⩾ 0.5; else y_{i} = 0 (Label 0), \end{matrix}

(9)

where $β = (2, \underset{10}{\underset{︸}{2 / \sqrt{5}, \dots, 2 / \sqrt{5}}}, - 2, \underset{10}{\underset{︸}{- 2 / \sqrt{5}, \dots, - 2 / \sqrt{5}}}, 4$ , $\underset{10}{\underset{︸}{4 / \sqrt{5}, \dots, 4 / \sqrt{5}}}, - 4, \underset{10}{\underset{︸}{- 4 / \sqrt{5}, \dots, - 4 / \sqrt{5}}}, 0, \dots, 0)$ for Model 1, and ε ~ N(0, σ _e ²).

Model 2 was defined similar to Model 1, except that we considered the case when the TF can have positive and negative effects on its regulated genes at the same time:

\begin{matrix} β = (2, \frac{- 2}{\sqrt{5}}, \frac{- 2}{\sqrt{5}}, \frac{- 2}{\sqrt{5}}, \underset{7}{\underset{︸}{\frac{2}{\sqrt{5}}, \dots, \frac{2}{\sqrt{5}}}}, - 2, \frac{2}{\sqrt{5}}, \frac{2}{\sqrt{5}}, \frac{2}{\sqrt{5}}, \underset{7}{\underset{︸}{\frac{- 2}{\sqrt{5}}, \dots, \frac{- 2}{\sqrt{5}}}}, 4, \frac{- 4}{\sqrt{5}}, \frac{- 4}{\sqrt{5}}, \frac{- 4}{\sqrt{5}}, \underset{7}{\underset{︸}{\frac{4}{\sqrt{5}}, \dots, \frac{4}{\sqrt{5}}}}, 4, \frac{4}{\sqrt{5}}, \frac{4}{\sqrt{5}}, \frac{4}{\sqrt{5}}, \underset{7}{\underset{︸}{\frac{- 4}{\sqrt{5}}, \dots, \frac{- 4}{\sqrt{5}}}}, 0, \dots, 0) . \end{matrix}

(10)

In these two models, the 10-fold cross validation approach was conducted on the training datasets to tune the regularization parameters of the enhanced L _1/2 net, L _1/2 net, and L ₁ net. Both penalized parameters for L ₁ and ridge regularization in the Elastic net were tuned by the 10-fold cross validation on the two-dimensional parameter surfaces. We repeated the simulations over 100 times and then computed the misclassification error, the sensitivity, and the specificity averagely for each net model on the test datasets.

Table 1 summarizes the simulation results from each regularization net model. In general, our proposed enhanced L _1/2 net model achieved the smallest misclassification errors in Models 1 (9.22%) and 2 (10.76%) compared with the other regularization methods including the old L _1/2 thresholding method (9.85% for Model 1 and 10.83% for Model 2), L ₁ net (11.81% for Model 1 and 13.21% for Model 2), and the Elastic net (13.12% for Model 1 and 14.14% for Model 2). Meanwhile, the enhanced L _1/2 net resulted in the highest sensitivity in Model 1 (98.5%) compared with the other methods. Moreover, the enhanced L _1/2 net obtained the best specificity in Model 2 (98.7%) amongst the other approaches. To sum up, the enhanced L _1/2 net outperforms the other three algorithms in terms of prediction accuracy, sensitivity, and specificity.

Table 1.

Simulation results of the enhanced L _1/2 net, L _1/2 net, L ₁ net, and Elastic net, respectively.

Model	Misclassification errors (%)				Sensitivity (%)				Specificity (%)
Model	Eh_L _1/2	L _1/2	L ₁	Elastic	Eh_L _1/2	L _1/2	L ₁	Elastic	Eh_L _1/2	L _1/2	L ₁	Elastic
1	9.22	9.85	11.81	13.12	0.985	0.971	0.968	0.873	0.969	0.970	0.962	0.981
1	(0.36)	(0.31)	(0.41)	(0.12)	(0.00)	(0.00)	(0.02)	(0.00)	(0.00)	(0.01)	(0.01)	(0.00)

2	10.76	10.83	13.21	14.14	0.939	0.939	0.943	0.835	0.987	0.981	0.987	0.980
2	(0.33)	(0.36)	(0.24)	(0.23)	(0.00)	(0.00)	(0.01)	(0.00)	(0.02)	(0.01)	(0.01)	(0.00)

Open in a new tab

Simulation results (averaged over 100 runs) for comparison of misclassification errors, sensitivity, and specificity used the enhanced L _1/2 net, L _1/2 net, L ₁ net, and the Elastic net, respectively. The standard errors are given in parentheses.

4.2. Analysis of Lung Cancer

In this section, we merged the protein-protein interaction (PPI) network (see http://thebiogrid.org/) with a lung cancer (LC) gene-expression dataset [23] to demonstrate the performance of our proposed enhanced L _1/2 net method. The gene-expression dataset contains the expression profiles of 22284 genes for 107 patients, in which 58 had lung cancer. To test the generalization ability of the proposed method, we divided the dataset into the training set (sample size n = 70; 38 LC, 32 non-LC) which covered 2/3 samples of the dataset and the test set (sample size n = 37; 20 LC, 17 non-LC) which covered the other 1/3 specimens of the dataset. The 10-fold cross validation approach was conducted on the training dataset to tune the regularization parameters. By combining the gene-expression data with the PPI network, the final PPI network includes 8619 genes and 28293 edges.

Figures 1 –4 display the solution paths of the four regularization net methods for the LC dataset in one sample run. Here, x-axis displays the values of the running lambda (the running lambda of L ₁ penalty in the Elastic net approach), and x-axis at the top (degrees of freedom) means the number of nonzero coefficients of beta. y-axis is the values of the coefficients beta which measure the gene importance. The predictive model builds from the training set and then tests its predictive performance on the test set. The detailed results were represented in Table 2.

The solution paths of the enhanced L _1/2 net for the lung cancer dataset in one sample run.

The solution paths of L _1/2 net for the lung cancer dataset in one sample run.

The solution paths of L ₁ net for the lung cancer dataset in one sample run.

The solution paths of the Elastic net for the lung cancer dataset in one sample run.

Table 2.

The results of the enhanced L _1/2 net, L _1/2 net, L ₁ net, and Elastic net on LC dataset, respectively.

	Selected genes	Connected genes	Connected edges	Cross validation error	Test error
Eh_L _1/2 net	171	54	41	6/70	5/37
L _1/2 net	193	61	47	6/70	6/37
L ₁ net	500	150	121	7/70	6/37
Elastic	636	337	510	6/70	6/37

Open in a new tab

Results of analysis of LC gene expression dataset by four procedures, including the number of genes selected, the number of linked PPI network genes, the number of linked PPI network edges, the CV error, and test errors.

As shown in Table 2, the enhanced L _1/2 net selected the fewest number of genes and edges compared to L _1/2 net, L ₁ net, and the Elastic net. Meanwhile, the predictive performance of the enhanced L _1/2 net outperforms the other three regularization net algorithms.

To further evaluate the performance of the enhanced L _1/2 net procedure, we report its capacity of identifying the biomarkers related to lung cancer. NK2 homeobox 1 (Nkx2-1) protein regulates transcription of genes specific for lung. It is used as a biomarker to determine lung cancer in anatomic pathology. It also has a critical role in maintaining lung tumor cells [24, 25]. Epidermal growth factor receptor (EGFR) is known to play a key role in cell proliferation and apoptosis. EGFR overexpression and activity could result in tumor growth and progression [26] and somatic mutations within the tyrosine kinase domain of EGFR, which have been identified in a subset of lung adenocarcinoma [27, 28]. The enhanced L _1/2 net (Figure 5) and L _1/2 net successfully identified these two important biomarkers for LC. However, neither L ₁ net nor the Elastic net selected them both.

Subnetworks identified by the enhanced L _1/2 net for lung cancer datasets (only those genes that are linked on the PPI network are plotted).

Except to identify these two significant biomarkers (EGFR and Nkx2-1), the enhanced L _1/2 net also selected several pathways that were associated with lung cancer. For example, one of the subnetworks includes genes involving molecular proliferation (e.g., genes ARF4, EGFR, DCN, BRCA1, and ITIH5). As these gene express significantly and continuously, it promotes lung cancer progression. On the other hand, this group is linked to ENO1. We are unable to get a clear testimony to sustain this relationship by looking at PPI database. However, a recent report [29] has demonstrated that ENO1 is the promising biomarker that may provide more diagnostic efficacy for lung cancer. This link implies a functional relationship and suggests the important role of ENO1 in lung cancer.

All these results reveal that the enhanced L _1/2 net is more reliable than L _1/2 net, L ₁ net, and the Elastic net approaches for selecting key markers from high-dimensional genomic data. Another advantage of our proposed method is that it has the ability to recognize novel and potential relationships with biologic significance. It is mentionable that our proposed method is inclined to identify fewer but more informative genes (or edges) than L ₁ net and the Elastic net approaches in genomic data and that means the proposed method has allowed the researcher to more easily concentrate on the key targets for functional studies or downstream applications.

5. Conclusions

In biological molecular research, especially for cancer, the analysis of combining biological pathway information with gene-expression data may play an important role to search for new targets for drug design. In this paper, we use the enhanced L _1/2 solver to penalized network-constrained logistic regression model to integrate lung cancer gene-expression with protein-to-protein interaction network. We develop the corresponding coordinate descent algorithm as a novel biomarker selection approach. This algorithm is extremely fast and easy to implement. Both simulation and real genomic data studies showed that the enhanced L _1/2 net is a ranking procedure compared with L _1/2 net (using the old thresholding operator), L ₁ net, and the Elastic net in the selection of biomarker and subnetwork.

We successfully identified several important clinical biomarkers and subnetwork that are driving lung cancer. The proposed method has provided new information to investigators in biological studies and can be the efficient tool for identifying cancer related biomarker and subnetwork.

Supplementary Material

“Sub-networks identified by the L_1 net and the Elastic net for lung cancer datasets (only those genes that are linked on the PPI network are plotted). Nodes colored based on higher (red) to lower (green) coefficients in the model.”

713953.f1.zip^{(282.8KB, zip)}

Acknowledgment

This work was supported by the Macau Science and Technology Development Funds (Grant no. 099/2013/A3) of Macau SAR of China.

Conflict of Interests

The authors declare no conflict of interests.

References

1.Shevade S. K., Keerthi S. S. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics. 2003;19(17):2246–2253. doi: 10.1093/bioinformatics/btg308. [DOI] [PubMed] [Google Scholar]
2.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Methodological. 1996;58(1):267–288. [Google Scholar]
3.Meier L., van de Geer S., Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society. Series B. Statistical Methodology. 2008;70(1):53–71. doi: 10.1111/j.1467-9868.2007.00627.x. [DOI] [Google Scholar]
4.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]
5.Meinshausen N., Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37(1):246–270. doi: 10.1214/07-aos582. [DOI] [Google Scholar]
6.Xu Z. B., Zhang H., Wang Y., Chang X. Y., Liang Y. L1/2 regularization. Science in China Series F. 2010;40(3):1–11. [Google Scholar]
7.Xu Z., Chang X., Xu F., Zhang H. L 1/2 regularization: a thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems. 2012;23(7):1013–1027. doi: 10.1109/tnnls.2012.2197412. [DOI] [PubMed] [Google Scholar]
8.Zeng J., Lin S., Wang Y., et al. L 1/2 regularization: convergence of iterative half thresholding algorithm. IEEE Transactions on Signal Processing. 2014;62(9):2317–2329. [Google Scholar]
9.Stark C., Breitkreutz B.-J., Reguly T., Boucher L., Breitkreutz A., Tyers M. BioGRID: a general repository for interaction datasets. Nucleic acids research. 2006;34(supplement 1):D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Chuang H.-Y., Lee E., Liu Y.-T., Lee D., Ideker T. Network-based classification of breast cancer metastasis. Molecular Systems Biology. 2007;3, article 140 doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Li C., Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]
12.Tian Z., Hwang T., Kuang R. A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge. Bioinformatics. 2009;25(21):2831–2838. doi: 10.1093/bioinformatics/btp467. [DOI] [PubMed] [Google Scholar]
13.Zhang W., Wan Y.-W., Allen G. I., Pang K., Anderson M. L., Liu Z. Molecular pathway identification using biological network-regularized logistic models. BMC Genomics. 2013;14(8, article S7) doi: 10.1186/1471-2164-14-s8-s7. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Sun H., Lin W., Feng R., Li H. Network-regularized high dimensional Cox regression for Analysis of Genomic data. Statistica Sinica. 2014;24:1433–1459. doi: 10.5705/ss.2012.317. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Liang Y., Liu C., Luan X.-Z., et al. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics. 2013;14(1, article 198) doi: 10.1186/1471-2105-14-198. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Liu C., Liang Y., Luan X.-Z., et al. The L1/2 regularization method for variable selection in the Cox model. Applied Soft Computing Journal. 2014;14:498–503. doi: 10.1016/j.asoc.2013.09.006. [DOI] [Google Scholar]
17.Knight K., Fu W. Asymptotics for Lasso-type estimators. Annals of Statistics. 2000;28(5):1356–1378. doi: 10.1214/aos/1015957397. [DOI] [Google Scholar]
18.Malioutov D., Çetin M., Willsky A. S. A sparse signal reconstruction perspective for source localization with sensor arrays. IEEE Transactions on Signal Processing. 2005;53(8):3010–3022. doi: 10.1109/TSP.2005.850882. [DOI] [Google Scholar]
19.Friedman J., Hastie T., Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]
20.Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
21.Zhang C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942. doi: 10.1214/09-AOS729. [DOI] [Google Scholar]
22.Breheny P., Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics. 2011;5(1):232–253. doi: 10.1214/10-aoas388. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Landi M. T., Dracheva T., Rotunno M., et al. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE. 2008;3(2) doi: 10.1371/journal.pone.0001651.e1651 [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Li Y., Eggermont K., Vanslembrouck V., Verfaillie C. M. NKX2-1 activation by SMAD2 signaling after definitive endoderm differentiation in human embryonic stem cell. Stem Cells and Development. 2013;22(9):1433–1442. doi: 10.1089/scd.2012.0620. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Weir B. A., Woo M. S., Getz G., et al. Characterizing the cancer genome in lung adenocarcinoma. Nature. 2007;450(7171):893–898. doi: 10.1038/nature06358. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Zhang H., Berezov A., Wang Q., et al. ErbB receptors: from oncogenes to targeted cancer therapies. The Journal of Clinical Investigation. 2007;117(8):2051–2058. doi: 10.1172/jci32278. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Paez J. G., Jänne P. A., Lee J. C., et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304(5676):1497–1500. doi: 10.1126/science.1099314. [DOI] [PubMed] [Google Scholar]
28.Chowdhuri S. R., Xi L., Pham T. H.-T., et al. EGFR and KRAS mutation analysis in cytologic samples of lung adenocarcinoma enabled by laser capture microdissection. Modern Pathology. 2012;25(4):548–555. doi: 10.1038/modpathol.2011.184. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Yu L., Shen J., Mannoor K., Guarnera M., Jiang F. Identification of ENO1 as a potential sputum biomarker for early stage lung cancer by shotgun proteomics. Clinical Lung Cancer. 2014;15(5):372–378. doi: 10.1016/j.cllc.2014.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

713953.f1.zip^{(282.8KB, zip)}

[B1] 1.Shevade S. K., Keerthi S. S. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics. 2003;19(17):2246–2253. doi: 10.1093/bioinformatics/btg308. [DOI] [PubMed] [Google Scholar]

[B2] 2.Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Methodological. 1996;58(1):267–288. [Google Scholar]

[B3] 3.Meier L., van de Geer S., Bühlmann P. The group Lasso for logistic regression. Journal of the Royal Statistical Society. Series B. Statistical Methodology. 2008;70(1):53–71. doi: 10.1111/j.1467-9868.2007.00627.x. [DOI] [Google Scholar]

[B4] 4.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101(476):1418–1429. doi: 10.1198/016214506000000735. [DOI] [Google Scholar]

[B5] 5.Meinshausen N., Yu B. Lasso-type recovery of sparse representations for high-dimensional data. The Annals of Statistics. 2009;37(1):246–270. doi: 10.1214/07-aos582. [DOI] [Google Scholar]

[B6] 6.Xu Z. B., Zhang H., Wang Y., Chang X. Y., Liang Y. L1/2 regularization. Science in China Series F. 2010;40(3):1–11. [Google Scholar]

[B7] 7.Xu Z., Chang X., Xu F., Zhang H. L 1/2 regularization: a thresholding representation theory and a fast solver. IEEE Transactions on Neural Networks and Learning Systems. 2012;23(7):1013–1027. doi: 10.1109/tnnls.2012.2197412. [DOI] [PubMed] [Google Scholar]

[B8] 8.Zeng J., Lin S., Wang Y., et al. L 1/2 regularization: convergence of iterative half thresholding algorithm. IEEE Transactions on Signal Processing. 2014;62(9):2317–2329. [Google Scholar]

[B9] 9.Stark C., Breitkreutz B.-J., Reguly T., Boucher L., Breitkreutz A., Tyers M. BioGRID: a general repository for interaction datasets. Nucleic acids research. 2006;34(supplement 1):D535–D539. doi: 10.1093/nar/gkj109. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] 10.Chuang H.-Y., Lee E., Liu Y.-T., Lee D., Ideker T. Network-based classification of breast cancer metastasis. Molecular Systems Biology. 2007;3, article 140 doi: 10.1038/msb4100180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B11] 11.Li C., Li H. Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–1182. doi: 10.1093/bioinformatics/btn081. [DOI] [PubMed] [Google Scholar]

[B12] 12.Tian Z., Hwang T., Kuang R. A hypergraph-based learning algorithm for classifying gene expression and arrayCGH data with prior knowledge. Bioinformatics. 2009;25(21):2831–2838. doi: 10.1093/bioinformatics/btp467. [DOI] [PubMed] [Google Scholar]

[B13] 13.Zhang W., Wan Y.-W., Allen G. I., Pang K., Anderson M. L., Liu Z. Molecular pathway identification using biological network-regularized logistic models. BMC Genomics. 2013;14(8, article S7) doi: 10.1186/1471-2164-14-s8-s7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14] 14.Sun H., Lin W., Feng R., Li H. Network-regularized high dimensional Cox regression for Analysis of Genomic data. Statistica Sinica. 2014;24:1433–1459. doi: 10.5705/ss.2012.317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] 15.Liang Y., Liu C., Luan X.-Z., et al. Sparse logistic regression with a L1/2 penalty for gene selection in cancer classification. BMC Bioinformatics. 2013;14(1, article 198) doi: 10.1186/1471-2105-14-198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B16] 16.Liu C., Liang Y., Luan X.-Z., et al. The L1/2 regularization method for variable selection in the Cox model. Applied Soft Computing Journal. 2014;14:498–503. doi: 10.1016/j.asoc.2013.09.006. [DOI] [Google Scholar]

[B17] 17.Knight K., Fu W. Asymptotics for Lasso-type estimators. Annals of Statistics. 2000;28(5):1356–1378. doi: 10.1214/aos/1015957397. [DOI] [Google Scholar]

[B18] 18.Malioutov D., Çetin M., Willsky A. S. A sparse signal reconstruction perspective for source localization with sensor arrays. IEEE Transactions on Signal Processing. 2005;53(8):3010–3022. doi: 10.1109/TSP.2005.850882. [DOI] [Google Scholar]

[B19] 19.Friedman J., Hastie T., Tibshirani R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software. 2010;33(1):1–22. [PMC free article] [PubMed] [Google Scholar]

[B20] 20.Fan J., Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]

[B21] 21.Zhang C.-H. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics. 2010;38(2):894–942. doi: 10.1214/09-AOS729. [DOI] [Google Scholar]

[B22] 22.Breheny P., Huang J. Coordinate descent algorithms for nonconvex penalized regression, with applications to biological feature selection. Annals of Applied Statistics. 2011;5(1):232–253. doi: 10.1214/10-aoas388. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] 23.Landi M. T., Dracheva T., Rotunno M., et al. Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE. 2008;3(2) doi: 10.1371/journal.pone.0001651.e1651 [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] 24.Li Y., Eggermont K., Vanslembrouck V., Verfaillie C. M. NKX2-1 activation by SMAD2 signaling after definitive endoderm differentiation in human embryonic stem cell. Stem Cells and Development. 2013;22(9):1433–1442. doi: 10.1089/scd.2012.0620. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] 25.Weir B. A., Woo M. S., Getz G., et al. Characterizing the cancer genome in lung adenocarcinoma. Nature. 2007;450(7171):893–898. doi: 10.1038/nature06358. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] 26.Zhang H., Berezov A., Wang Q., et al. ErbB receptors: from oncogenes to targeted cancer therapies. The Journal of Clinical Investigation. 2007;117(8):2051–2058. doi: 10.1172/jci32278. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B27] 27.Paez J. G., Jänne P. A., Lee J. C., et al. EGFR mutations in lung cancer: correlation with clinical response to gefitinib therapy. Science. 2004;304(5676):1497–1500. doi: 10.1126/science.1099314. [DOI] [PubMed] [Google Scholar]

[B28] 28.Chowdhuri S. R., Xi L., Pham T. H.-T., et al. EGFR and KRAS mutation analysis in cytologic samples of lung adenocarcinoma enabled by laser capture microdissection. Modern Pathology. 2012;25(4):548–555. doi: 10.1038/modpathol.2011.184. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] 29.Yu L., Shen J., Mannoor K., Guarnera M., Jiang F. Identification of ENO1 as a potential sputum biomarker for early stage lung cancer by shotgun proteomics. Clinical Lung Cancer. 2014;15(5):372–378. doi: 10.1016/j.cllc.2014.05.003. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Network-Based Logistic Classification with an Enhanced L _1/2 Solver Reveals Biomarker and Subnetwork Signatures for Diagnosing Lung Cancer

Hai-Hui Huang

Yong Liang

Xiao-Ying Liu

Abstract

1. Introduction

2. L _1/2 Penalized Network-Constrained Logistic Regression Model

3. A Coordinate Descent Algorithm for the Network-Constrained Logistic Model with the Enhanced L _1/2 Thresholding Operator

Algorithm 1 (the coordinate descent algorithm for L _1/2 penalized network-constrained logistic model). —