LogSum + L2 penalized logistic regression model for biomarker selection and cancer classification

Xiao-Ying Liu; Sheng-Bing Wu; Wen-Quan Zeng; Zhan-Jiang Yuan; Hong-Bo Xu

doi:10.1038/s41598-020-79028-0

. 2020 Dec 17;10:22125. doi: 10.1038/s41598-020-79028-0

LogSum + L₂ penalized logistic regression model for biomarker selection and cancer classification

Xiao-Ying Liu ^1,^✉, Sheng-Bing Wu ¹, Wen-Quan Zeng ¹, Zhan-Jiang Yuan ¹, Hong-Bo Xu ¹

PMCID: PMC7747646 PMID: 33335163

Abstract

Biomarker selection and cancer classification play an important role in knowledge discovery using genomic data. Successful identification of gene biomarkers and biological pathways can significantly improve the accuracy of diagnosis and help machine learning models have better performance on classification of different types of cancer. In this paper, we proposed a LogSum + L₂ penalized logistic regression model, and furthermore used a coordinate decent algorithm to solve it. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods. Our proposed model achieves the excellent performance in group feature selection and classification problems.

Subject terms: Computational biology and bioinformatics, Mathematics and computing

Introduction

With the development of DNA microarray technology^1,2, the biological researchers can analyze the expression levels of thousands of genes simultaneously. Many studies have shown that microarray data can be used to classify the different types of cancer, which includes how long the incubation period is, and what drugs are effective in the diagnosis and treatment processes.

From a biological point of view³, only a small number of genes (biomarkers) strongly indicate the target cancer, while other genes are not related to disease. Therefore, the data with unrelated genes may bring noise, and make the machine learning approaches less easy to find pathogenic genes that cause the disease. Moreover, from a machine learning perspective, the large number of genes (features) with few samples in the datasets may cause overfitting⁴, and have negative impact on classification performance. Due to the importance of these issues, effective gene (biomarker) selection methods are needed to help classify different cancer types and improve prediction accuracy.

In recent years, many methods for gene selection in microarray datasets have been developed and generally can be divided into three categories: filters, wrappers, and embedded methods. Filter methods^5–8 evaluate genes based on discriminative power without considering their regulation correlations with other genes. The main disadvantage of the filtering methods is that it examines each gene separately, and makes each gene independent, thereby ignores the possibility that the genes have combined and grouping effects. This is a common problem with statistical methods, such as t-test, which can also examine each gene individually.

Wrapper methods^9–11 utilize feature assessment measures based on the learning performance to select subsets of genes. Generally, they can acquire a small number of related genes to notable promote the learning ability. In some cases, the results of the wrapper methods are better than those of the filter methods. However, the main fault of wrapper methods is their computational cost is high.

A third set of feature selection approaches is the embedded methods^12–26 that perform feature selection as part of the learning procedure of a single process. Under similar learning performance, the computational efficiency of embedded methods is more efficient than wrapper approaches. Hence, embedded methods have recently attracted a lot of attention in the literature. The regularization methods are important embedded technologies, which can perform feature selection and model training simultaneously. Many regularization methods have been proposed, such as Lasso¹², SCAD¹³, adaptive Lasso¹⁴, MCP¹⁵, L_q (0 < q < 1)¹⁶, L_1/2^17,18, LogSum¹⁹, etc. These methods perform well with the independent feature selection. When the features are highly correlated, some regularization methods which pay attention to the grouping effect can be used to select the groups of the relevant features, such as group Lasso²⁰, Elastic net²¹, Fused Lasso²², OSCAR²³, adaptive Elastic net^2,4, SCAD-L₂^2,5, L_1/2 + L₂^2,6.

On the other hand, many machine learning models have been used to analyze microarray gene expression data for cancer classification. For example, Furey et al. used support vector machines (SVMs) to classify cell and tissue types²⁷. Medjahed et al. applied the K-nearest neighbors (K-NN) to the diagnosis and classification of breast cancer²⁸. Meanwhile, some researchers used the logistic regressions with optimization methods for binary cancer classification^29–33. However, the traditional logistic regression model has two obvious shortcomings, mainly in the following two aspects:

Feature selection problem.

All or most of the feature coefficients obtained by fitting the logistic regression model are not zero, i.e. all most of the features are related to the classification target and not sparse. However, the key factors affecting the model are often only a few in many practical problems. This non-sparseness of the logistic models increases the computational complexity on the one hand and is not conducive to the actual interpretation of the practical problems.
Overfitting problem.

The logistic regression models can often obtain good precision for the training data, but for the test data outside the training set, the classification accuracy rate is not ideal. In fact, not only logistic regression, many other data analysis models will also be affected by overfitting. It has become one of the hot research topics in statistics, machine learning and other fields.

In recent years, there is growing interesting to apply the regularization techniques in the logistic regression models to solve the above mentioned two shortcomings. For example, Tibshirani and Friedman^34,35 proposed the sparse logistic regression based on the Lasso regularization and the coordinate descent methods. Algamal et al.^36,37 proposed the adaptive Lasso and the adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Like sparse logistic regression with the L₁ regularization method, Cawley and Talbot³⁰ investigated sparse logistic regression with Bayesian regularization. Liang et al.^3,8 investigated the sparse logistic regression model with the L_1/2 penalty for gene selection in cancer classification.

Inspired by above mentioned methods, in this paper, we proposed a LogSum + L₂ penalized logistic regression model. The main contributions of this paper include.

Our proposed method can not only select sparse features (biomakers), but also identify the groups of the relevant features (gene pathways). The coordinate decent algorithm is used to solve the LogSum + L₂ penalized logistic regression model.
We also evaluate the capability of our proposed method and compare its performance with other regularization methods. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods.

The rest of this paper is organized as follows. In “Related works” section, we introduce the related work. “Methods” section represents the LogSum + L₂ penalized logistic regression model and its optimization algorithm. “Experiments experimental results and discussion” section analyzes the results of the simulated data. “Discussion and conclusion” section analyzes the results of real data. Section 6 concludes this paper.

Related works

Sparse penalized logistic regression

We focused on binary classification using logistic regression (LR), which is a statistical method for modeling a binary classification problem. Suppose we have n samples and p genes. Datasets X and y are the genes matrix and the dependent variable, respectively. So, the n samples mean the set D, $x_{ij}$ denotes the value of gene $j$ for the $i th$ samples, y_i is a corresponding variable that takes a value of 0 or 1, $y_{i}$ = 0 indicates the $i th$ sample in Class 1 and $y_{i}$ = 1 indicates the $i th$ sample is in Class 2. Then, we define a classifier $f (x) = \frac{e^{x}}{(1 + e^{x})}$ such that for any input x with class label $y$ , $f (x)$ predicts y correctly. The $LR$ is given as follows:

P (y_{i} = 1 | X_{i}) = f (X_{i}^{'} β) = \frac{e^{(X_{i}^{'} β)}}{1 + e^{(X_{i}^{'} β)}}

In Eq. (1), $β = (β_{0}, β_{1}, . . ., β_{p})$ are the coefficients need to be estimated. We should notice that $β_{0}$ is the intercept. The log-likelihood function of the transformation of Eq. (1) is defined as:

l (β) = - \sum_{i = 1}^{n} {y_{i} log [f (X_{i}^{'} β)] + (1 - y_{i}) log [1 - f (X_{i}^{'} β)]}

Then we can obtain the coefficients $β$ when Eq. (2) is minimized. In the cancer classification problem with high-dimensional and low-sample size data $(p ≫ n),$ directly solving the logistic model (2) will make overfitting. Therefore, to solve this problem, we need add a regularization term to (2), the sparse logistic regression can be modelled as:

β = a r g m i n \{l (β) + λ \sum_{j = 1}^{p} p (β_{j})\}

where $l (β)$ is the loss function, $p (β)$ is the penalty function, and $λ > 0$ is a control parameter.

A coordinate decent algorithm for different thresholding operators

The coordinate decent algorithm is a “one-at-a-time” approach⁴⁰, and before considering the coordinate descent algorithm for the nonlinear logistic regularization, we first introduce a linear regression case. The objective function of the linear regression is as follow:

m i n \{\frac{1}{2 n} | | y - {X β | |}^{2} + P_{λ} (β)\}

where $y = {(y_{1}, \dots, y_{n})}^{T}$ is the vector of n response variables, $X_{i} = (x_{i 1}, x_{i 2}, \dots, x_{ij})$ is ith input variables with dimensionality $p$ and $y_{i}$ is the corresponding response variable. $| | . | |$ denotes the $L_{2}$ -norm.

The coordinate decent algorithm “one-at-a-time” is to solve $β_{j}$ and other $β_{k \neq j}$ (represent the coefficients $β_{k \neq j}$ remained after $j th$ element $β_{j}$ is removed) are fixed. The Eq. (4) can be rewritten as:

R (β) = a r g m i n \{\frac{1}{2 n} {(y_{i} - (\sum_{k \neq j} x_{ik} β_{k} + x_{ij} β_{j}))}^{2} + λ \sum_{k \neq j} P (β_{k}) + λ P (β_{j})\}

In Eq. (5), kth represents other features than the jth feature.

The first order derivative at $β_{j}$ can be estimated as:

\frac{\partial R}{\partial β_{j}} = \sum_{i = 1}^{n} (- x_{ij} (y_{j} - \sum_{k \neq j} x_{ik} β_{k} - x_{ij} β_{j})) + λ P (β_{j}) = 0

We define ${\tilde{y}}_{i}^{(j)} = \sum_{k \neq j} x_{ik} β_{k}$ as a part of fitting $β_{j}$ , ${\tilde{r}}_{i}^{(j)} = y_{i} - {\tilde{y}}_{i}^{(j)}$ , and $w_{j} = \sum_{i = 1}^{n} x_{ij} {\tilde{r}}_{i}^{(j)}$ , where ${\tilde{r}}_{i}^{(j)}$ represents the partial residuals with respect to the jth feature.

To consider the correlation of features, Elastic Net ( $L_{EN}$ )²¹ had been proposed, which emphasizes a grouping effect. The $L_{EN}$ penalty function is given as follows:

P (β) = (1 - a) \frac{1}{2} {| | β | |}_{L_{2}}^{2} + {a | | β | |}_{L_{1}}

The penalty function of $L_{EN}$ is combination of $L_{1}$ penalty and ridge penalty which $a = 1$ and $a = 0$ respectively. Therefore, Eq. (6) is rewritten as follows:

\frac{\partial R}{\partial β_{j}} = \sum_{i = 1}^{n} (- x_{ij} (y_{j} - \sum_{k \neq j} x_{ik} β_{k} - x_{ij} β_{j})) + λ (1 - a) β_{j} + λ a = 0

Zou and Hastie have proposed the univariate solution²¹ for a $L_{EN}$ penalized regression coefficient as follows:

β_{j} = f_{L_{EN}} (w_{j}, λ, a) = \frac{S (w_{j}, λ a)}{1 + λ (1 - a)}

where $S (w_{j}, λ a)$ is soft thresholding operator for the $L_{1}$ penalty if a is equal to 1, so Eq. (9) can be divided into three situations as follows:

β_{j} = S o f t (w_{j}, λ) = \{\begin{matrix} w_{j} + λ & if w_{j} < - λ \\ w_{j} - λ & if w_{j} > λ \\ 0 & if - λ \leq w_{j} \leq λ \end{matrix})

Fan et al. have proposed the SCAD penalty¹³, which can produce sparse set of solutions and approximately unbiased coefficients for large coefficients. Its penalty function is shown as follows:

p_{λ, a} (β) = \{\begin{matrix} λ β & if β \neq λ \\ \frac{a λ β - \frac{1}{2} (β^{2} + λ^{2})}{a - 1} & if λ < β < a λ \\ \frac{λ (a^{2} - 1)}{2 (a - 1)} & if β > a λ \end{matrix})

Additionally, the SCAD thresholding operator is given as follows:

β_{j} = f_{SCAD} (w_{j}, λ, a) = \{\begin{matrix} S (w_{j}, λ) & if |w_{j}| < 2 λ \\ \frac{S (w_{j}, \frac{a λ}{a - 1})}{1 - \frac{1}{a - 1}} & if 2 λ < |w_{j}| \leq a λ \\ w_{j} & if |w_{j}| > a λ \end{matrix})

Like the SCAD penalty, Zhang et al. have proposed the maximum concave penalty (MCP)¹⁵. The formula of its penalty function is shown as:

p_{λ, a} (β) = \{\begin{matrix} λ β & if β \leq γ λ \\ \frac{1}{2} γ λ^{2} & if β > γ λ \end{matrix})

And the MCP thresholding operator is given as follows:

β_{j} = f_{MCP} (w_{j}, λ, γ) = \{\begin{matrix} \frac{S (w_{j}, λ)}{1 - \frac{1}{λ}} & if |w_{j}| \leq γ λ \\ w_{j} & if |w_{j}| > γ λ \end{matrix})

In Eq. (14), $γ$ is the experience parameter.

Xu et al. have proposed $L_{1 / 2}$ regularization¹⁷, and its penalty function can be written:

m i n \{\frac{1}{2 n} {∥y - X β∥}^{2} + λ \sum_{j}^{p} {|β_{j}|}^{\frac{1}{2}}\}

Then the univariate half thresholding operator for a $L_{1 / 2}$ penalized linear regression coefficient is given as follows:

β_{j} = H a l f (w_{j}, λ) = \{\begin{matrix} \frac{2}{3} w_{j} (1 + cos \frac{2 (π - ϕ_{λ} (w_{j}))}{3}) & if |w_{j}| > \frac{3}{4} {(λ)}^{\frac{2}{3}} \\ 0 & if o t h e r w i s e \end{matrix})

in Eq. (16), $ϕ_{λ} (w) = \frac{λ}{8} {(\frac{|w|}{3})}^{- \frac{3}{2}}$ .

To consider the correlation of genes, Huang et al. have proposed $HLR$ regularization²⁶. Equation (15) can be rewritten:

m i n \{\frac{1}{2 n} {∥y - X β∥}^{2} + λ (\sum_{j}^{p} (a {|β_{j}|}^{\frac{1}{2}} + (1 - a) {|β_{j}|}^{2}))\}

And the univariate half thresholding operator for the $HLR$ penalized linear regression coefficient is as follows:

β_{j} = H L R (w_{j}, λ) = \frac{H a l f (w_{j}, λ a)}{1 + λ (1 - a)}

Theoretically, the L₀ regularization produces the better solutions with more sparsity, but it is NP problem. Therefore, Candes et al. have¹⁹ proposed $LogSum$ penalty, which approximates much better the $L_{0}$ regularization. We could rewrite the penalty function of the $LogSum$ regularization as follows:

m i n \{\frac{1}{2 n} {∥y - X β∥}^{2} + λ \sum_{j}^{p} l o g (|β_{j}| + ε)\}

where $ε > 0$ should be set arbitrarily small, to closely make the $LogSum$ penalty resemble the L₀-norm. Equation (19) has a local minimal^3,9.

f_{Logsum} (w_{j}, λ, ε) = D (w_{j}, λ, ε) = \{\begin{matrix} s i g n (w_{j}) \frac{c_{1} + \sqrt{c_{2}}}{2} & if c_{2} > 0 \\ 0 & if c_{2} \leq 0 \end{matrix})

where $λ > 0$ , $0 < ε < \sqrt{λ}$ , $c_{1} = w_{j} - ε$ , $c_{2} = c_{1}^{2} - 4 (λ - w_{j} ε)$ .

Methods

LogSum + L₂ penalized logistic regression model

In this paper, we proposed the LogSum + L₂ penalized logistic regression model for feature group selection. We could write the LogSum + L₂ penalty as follows:

m i n \{\frac{1}{2 n} {∥y - X β∥}^{2} + λ (\sum_{j}^{p} (λ_{1} l o g (|β_{j}| + ε) + λ_{2} {|β_{j}|}^{2}))\}

where ${∥y - X β∥}^{2}$ is the loss function, $(y, X)$ is a data set, $ε > 0$ is a constant, $λ > 0$ , $λ_{1} \geq 0$ and $λ_{2} \geq 0$ are regularization parameters that control the complexity of the penalty function.

Figure 1 describes the contour plots on two-dimensional for the penalty functions of L₁, L_EN, HLR and LogSum + L₂ approaches. It is demonstrated that the LogSum + L₂ penalty is non-convex for the given parameters $λ_{1}$ and $λ_{2}$ in Eq. (21).

Contour plots (two-dimensional) for the regularization methods.

The LogSum + L₂ thresholding operator is given as follows:

β_{j} = f_{L o g s u m + L_{2}} (w_{j}, λ, ε) = \{\begin{matrix} s i g n (w_{j}) \frac{(|w_{j}| - (1 + 2 λ_{2}) ε) + \sqrt{{(|w_{j}| + (1 + 2 λ_{2}) ε)}^{2} - 4 λ_{1} (1 + 2 λ_{2})}}{2 (1 + 2 λ_{2})} & i f |w_{j}| > λ \\ 0 & i f o t h e r w i s e \end{matrix})

where $λ = 2 \sqrt{λ_{1} (1 + 2 λ_{2})} - (1 + 2 λ_{2}) ε$ , $λ_{1} + λ_{2} = 1$ .

The proof of Eq. (22) is given as follows:

Considering the regression model has the following form

y = X β + e

where the response $y \in R^{n}$ , the predictors $X = (x_{1}, x_{2}, . . ., x_{p}), X \in R^{n \times p}$ and the error term $e = (e_{1}, e_{2}, . . ., e_{n})$ are i.i.d. with mean 0 variance $σ^{2}$ .

The $L o g s u m + L_{2}$ regularization can be expressed as:

l_{L o g s u m + L_{2}} (β ; λ_{1}, λ_{2}) = \frac{1}{2} {∥y - X β∥}^{2} + \sum_{j = 1}^{p} (λ_{1} log (|β_{j} + ε|) + λ_{2} β_{j}^{2})

Its first partial derivative with respect to $β_{k}$ is given by follows:

\begin{matrix} \frac{\partial l_{L o g s u m + L_{2}}}{\partial β_{k}} & = \sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{p} x_{ij} β_{j}) (- x_{ik}) + \frac{λ_{1} s i g n (β_{k})}{|β_{k}| + ε} + 2 λ_{2} β_{k} \\ = \sum_{i = 1}^{n} (y_{i} - \sum_{j \neq k}^{p} x_{ij} β_{j} - x_{ik} β_{k}) (- x_{ik}) + \frac{λ_{1} s i g n (β_{k})}{|β_{k}| + ε} + 2 λ_{2} β_{k} \\ = \sum_{i = 1}^{n} (y_{i} - \sum_{j \neq k}^{p} x_{ij} β_{j}) (- x_{ik}) + \sum_{i = 1}^{n} x_{ik}^{2} β_{k} + \frac{λ_{1} s i g n (β_{k})}{|β_{k}| + ε} + 2 λ_{2} β_{k} \\ = \sum_{i = 1}^{n} (y_{i} - \sum_{j = 1}^{p} x_{ij} β_{j}) (- x_{ik}) + β_{k} + \frac{λ_{1} s i g n (β_{k})}{|β_{k}| + ε} + 2 λ_{2} β_{k} \end{matrix}

Equation (25) is obtained from condition that the design matrix $X$ is orthonormal. By setting the first partial derivative equal to zero, we obtain the estimator with its kth element ${\hat{β}}_{k}$ .

We first considers the situation $β_{j} > 0$ , let $r_{i}^{(k)} = y_{i} - \sum_{j \neq k}^{p} x_{ij} β_{j}$ , $w_{k} = \sum_{i = 1}^{n} r_{i}^{(k)} (- x_{ik})$ . Set the first partial derivative $\frac{\partial l_{L o g s u m + L_{2}}}{\partial β_{k}} = 0$ , we have:

- w_{k} + β_{k} + \frac{λ_{1}}{β_{k} + ε} + 2 λ_{2} β_{k} = 0

and Eq. (26) is equivalent to follows:

(1 + 2 λ_{2}) β_{k}^{2} - (w_{k} - (1 + 2 λ_{2}) ε) β_{k} - w_{k} ε + λ_{1} = 0

Let

\begin{matrix} Δ & = {(w_{k} - (1 + 2 λ_{2}) ε)}^{2} - 4 (1 + 2 λ_{2}) (λ_{1} - w_{k} ε) \\ = {(w_{k} + (1 + 2 λ_{2}) ε)}^{2} - 4 λ_{1} (1 + 2 λ_{2}) \end{matrix}

We discuss the solutions of Eq. (27) according to the value of $Δ$ .

if $Δ < 0$ , Eq. (27) has no solution, that is no real root.
if $Δ = 0$ , Eq. (27) has unique root, that is ${\hat{β}}_{k} = \frac{w_{k} - (1 + 2 λ_{2}) ε}{2 (1 + 2 λ_{2})}$ .
if $Δ > 0$ , Eq. (27) has two roots, we have
${{(w}_{k} + (1 + 2 λ_{2}) ε)}^{2} > 4 λ_{1} (1 + 2 λ_{2})$

$w_{k} + (1 + 2 λ_{2}) ε > 2 \sqrt{λ_{1} (1 + 2 λ_{2})}$

Therefore, when $w_{k} \geq 2 \sqrt{λ_{1} (1 + 2 λ_{2})} - (1 + 2 λ_{2}) ε$ , we obtain the estimator

{\hat{β}}_{k} = \frac{w_{k} - (1 + 2 λ_{2}) ε + \sqrt{{(w_{k} + (1 + 2 λ_{2}) ε)}^{2} - 4 λ_{1} (1 + 2 λ_{2})}}{2 (1 + 2 λ_{2})}

For $β_{k} < 0$ , we can obtain the estimator in a similar way. Finally, we obtain the thresholding function of the $L o g s u m + L_{2}$ regularization as Eq. (22).

According to different thresholding operators, we also discuss three properties to satisfy the coefficient estimator as shown in Fig. 2:

Unbiasedness the resulting estimator is nearly unbiased when the true unknown parameter is large to avoid unnecessary modeling bias;
Sparsity the resulting estimator is a thresholding rule, which automatically sets a small estimated coefficient to zero to reduce model complexity;
Continuity the resulting estimator is continuous to avoid instability in model prediction.

Exact solution of (a) $L_{1}$ (b) $L_{EN}$ (c) $HLR$ (d) $L o g S u m + L_{2}$ in an orthogonal design.

Figure 2 shows four regularization methods: $L_{1}$ , $L_{EN}$ , $HLR$ and $L o g S u m + L_{2}$ penalties with an orthogonal design matrix in the regression model. The estimators of $L_{1}$ and $L_{EN}$ are biased, whereas the $HLR$ penalty is asymptotically unbiased. Similar to the $HLR$ method, the $L o g S u m + L_{2}$ approach also performs better than $L_{1}$ and $L_{EN}$ in the property of unbiasedness. All of these four regularization methods fulfil requirements of sparsity and continuity.

A coordinate decent algorithm for the LogSum + L₂ model

Inspired by Liang et al.^3,8, Eq. (3) is linearized by one-term Taylor series expansion:

L (β, λ) \approx \{\frac{1}{2 n} \sum_{i = 1}^{n} {(Z_{i} - X_{i} β)}^{'} W_{i} (Z_{i} - X_{i} β) + λ (\sum_{j}^{p} (λ_{1} l o g (|β_{j}| +) + λ_{2} {|β_{j}|}^{2}))\}

where $ε > 0$ , $Z_{i} = X_{i} \tilde{β} + \frac{Y_{i} - f (X_{i} \tilde{β})}{f (X_{i} \tilde{β}) (1 - f (X_{i} \tilde{β}))}$ is the estimated response, $W_{i} = f (X_{i} \tilde{β}) (1 - f (X_{i} \tilde{β}))$ is the weight and $f (X_{i} \tilde{β}) = \frac{e x p (X_{i} \tilde{β})}{1 + e x p (X_{i} \tilde{β})}$ . Redefine the partial residual for fitting current ${\tilde{β}}_{j}$ as ${\tilde{Z}}_{i}^{(j)} = \sum_{k \neq j} x_{ik} {\tilde{β}}_{k}$ and $w_{j} = \sum_{i = 1}^{n} W_{i} x_{ij} (Z_{i} - {\tilde{Z}}_{i}^{(j)})$ . A pseudocode of the coordinate descent algorithm for the $L o g s u m + L_{2}$ penalized logistic regression model is shown in Algorithm 1 (Fig. 3).

Flowchart of the coordinate descent algorithm for the $L o g S u m + L_{2}$ penalized logistic regression model.

Experiments experimental results and discussion

Analysis on simulated data

In this section, we analyze the performance of the proposed method (the $L o g S u m + L_{2}$ penalized logistic regression model) by simulation analysis. We compare the proposed method with other three methods, which are logistic regression with $L_{1}$ , $L_{EN}$ , $HLR$ regularizations. We simulate data from the true model.

log (\frac{y}{1 - y}) = X β + σ ε, ε \sim N (0, 1)

where $X \sim N (0, 1)$ , $ε$ is the independent random error and σ is the parameter that controls the signal to noise. Two scenarios are presented here. In each example, the dimension of features is 1000. Here are the details of the two scenarios.

In Scenario 1, the dataset consists of 200 observations, we set σ = 0.3 and simulate the group feature situation.
$β = (\underset{5}{\underset{⏟}{2, 2, 2, 2, 2}}, \underset{995}{\underset{⏟}{0, \dots, 0}}) ;$

$x_{i} = ρ \times x_{1} + (1 - ρ) \times x_{i}, i = 2, 3, 4, 5 ;$
where $ρ$ is the correlation coefficient of the group features.

In this example, there is one set of related features. The ideal sparse regression method should select 5 real features and set other 995 features as noise features to zero.
In Scenario 2, we set σ = 0.4 and the dataset consists of 400 observations, and defined two group features.
$β = (\underset{10}{\underset{⏟}{2, 2, 2, 2, 2, 1.5, - 2, 1.7, 3, - 2.5,}} \underset{10}{\underset{⏟}{3, \dots, 3}} \underset{980}{\underset{⏟}{0, \dots, 0}}) ;$

$x_{i} = ρ \times x_{1} + (1 - ρ) \times x_{i}, i = 2, 3, \dots, 10 ;$

$x_{i} = ρ \times x_{11} + (1 - ρ) \times x_{i}, i = 12, 13, \dots, 20 ;$

In this example, there are two sets of related group features. The ideal penalized logistic regression method should select 20 real features and set other 980 features as noise features to zero.

In this experiment, we initialize the coefficient $ρ$ of features’ correlation as 0.2, 0.6 respectively, and hope to observe the accuracy of testing under different correlations by running different correlation values. The $L_{1}$ and $L_{EN}$ approaches were executed by Glmnet (http://web.stanford.edu/~hastie/glmnet_matlab/, MATLAB version 2014-a). We use the tenfold cross-validation (CV) approach to optimize the regularization parameters or tuning parameters (balance the tradeoff between data fit and model complexity) of the $L_{1}$ , $L_{EN}$ , $HLR$ and $L o g S u m + L_{2}$ approaches.

At the beginning, we divided the datasets at random into the training sets and the test sets. In our experiment, the approximate 70% of samples are proposed as training sets, and the rest are used as test sets. We repeated the simulations 30 times for each penalty method and computed the mean classification accuracy, mean classification sensitivity, and mean classification specificity on the training and test datasets respectively. To evaluate the quality of the selected features for the regularization approaches, the sensitivity and specificity of the feature selection performance³⁹ were defined as the follows:

True Negative (T N) : = {|\bar{β} . * \bar{\hat{β}}|}_{0}, False Positive (F P) : = {|\bar{β} . * \hat{β}|}_{0}

False Negative (F N) : = {|β . * \bar{\hat{β}}|}_{0}, True Positive (T P) : = {|β . * \hat{β}|}_{0}

β -Sensitivity : = \frac{TP}{T P + F N}, β -Specificity : = \frac{TN}{T N + F P}

where the $. *$ is the element-wise product, and ${|.|}_{0}$ calculates the number of non-zero elements in a vector, $\bar{β}$ and $\bar{\hat{β}}$ are the logical “not” operators on the vector $β$ and $\hat{β}$ .

The training results of different methods on simulate datasets are reported in Table 1. As it can be seen, for all scenarios, our proposed $L o g S u m + L_{2}$ procedure generally achieves higher or comparable classification performance than the $L_{1}$ , $L_{EN}$ and $HLR$ methods. For example, in the Scenario 1 with $ρ$ = 0.6, our proposed method gained the 97.86% of accuracy, 95.38% of sensitivity and 100% of specificity, all of this data has increased by 6% for other methods. And whatever Scenario 1 or 2, the $L o g S u m + L_{2}$ methods always show the highest accuracy of training set, both $ρ$ = 0.2 and $ρ$ = 0.6. In summary, in the case of different scenarios and different values $ρ$ , the LogSum + L₂ penalized logistic regression model is always the best.

Table 1.

Training results of different methods on the simulated datasets.

Method

Scenario

Accuracy

Sensitivity

Specificity

AUC

0.2

L₁

90.00%

(1.85%)

98.78%

(0.37%)

91.30%

(1.12%)

99.82%

(0.01%)

88.73%(2.14%)

97.45%

(0.49%)

97.12%

(0.53%)

98.12%

(0.32%)

L_{EN}

87.14%

(2.28%)

99.16%

(0.12%)

86.96%

(2.97%)

99.78%

(0.03%)

87.32%

(3.17%)

97.33%

(0.52%)

95.08%

(0.93%)

97.89%

(0.35%)

HLR

94.29%

(0.59%)

98.65%

(0.35%)

95.65%

(0.62%)

99.82%

(0.01%)

92.96%

(1.26%)

98.31%

(0.38%)

98.84%

(0.21%)

98.53%

(0.33%)

L o g S u m + L_{2}

100%

(0%)

99.50%

(0.01%)

100%

(0%)

99.96%

(0.01%)

100%

(0%)

99.42%

(0.01%)

100%

(0%)

99.27%

(0.06%)

0.6

L₁

91.43%

(1.35%)

98.65%

(0.26%)

87.69%

(2.61%)

98.76%

(0.13%)

94.67%

(0.92%)

97.24%

(0.29%)

97.37%

(0.31%)

98.16%

(0.28%)

L_{EN}

85.71%

(2.03%)

97.76%

(0.31%)

69.23%

(2.84%)

97.84%

(0.20%)

100%

(0%)

98.86%

(0.22%)

96.04%

(0.48%)

98.07%

(0.34%)

HLR

90.71%

(1.76%)

98.65%

(0.23%)

87.69%

(1.48%)

99.12%

(0.04%)

93.33%

(0.81%)

98.21%

(0.26%)

97.58%

(0.40%)

98.54%

(0.32%)

L o g S u m + L_{2}

97.86%

(0.21%)

99.23%

(0.02%)

95.38%

(0.62%)

99.30%

(0.02%)

100%

(0%)

99.10%

(0.02%)

100%

(0%)

98.97%

(0.09%)

Open in a new tab

Numbers in parentheses are the standard deviations and the best results are highlighted in bold.

Table 2 shows test results of different methods on simulate datasets. We can find that the performance of the LogSum + L₂ penalized logistic regression model is still the best one among the four methods. And in Scenario 1, whatever $ρ$ = 0.2 or $ρ$ = 0.6, the $L o g S u m + L_{2}$ approach shows similar values, but in Scenario 2, the sensitivity of the LogSum + L₂ model is far apart, and its accuracy and specificity are not much different compared with other three methods.

Table 2.

Test results of different methods on the simulated datasets.

Method

Scenario

Accuracy

Sensitivity

Specificity

AUC

0.2

L_{1}

75.00%

(3.82%)

71.67%

(3.19%)

78.31%

(2.83%)

78.57%

(2.88%)

74.19%

(3.64%)

63.50%

(4.31%)

86.76%

(1.63%)

80.80%

(2.37%)

L_{EN}

78.33%

(3.15%)

66.67%

(4.18%)

79.54%

(2.68%)

75.00%

(3.07%)

87.10%

(2.63%)

53.13%

(6.16%)

84.43%

(1.87%)

76.23%

(3.49%)

HLR

80.00%

(1.86%)

65.00%

(4.03%)

79.63%

(2.66%)

71.43%

(3.92%)

87.10%

(2.52%)

58.76%

(5.83%)

88.65%

(1.31%)

77.23%

(3.46%)

L o g S u m + L_{2}

85.00%

(1.55%)

76.67%

(3.17%)

86.21%

(1.43%)

81.00%

(3.17%)

89.87%

(1.96%)

78.13%

(3.05%)

93.99%

(0.62%)

83.82%

(2.29%)

0.6

L_{1}

68.33%

(4.01%)

58.33%

(4.92%)

62.07%

(4.65%)

59.09%

(4.83%)

70.97%

(3.62%)

57.89%

(5.52%)

82.76%

(1.94%)

65.67%

(4.46%)

L_{EN}

71.67%

(3.61%)

56.67%

(5.32%)

55.17%

(5.18%)

63.64%

(4.78%)

77.42%

(2.95%)

44.74%

(8.03%)

81.76%

(2.43%)

59.93%

(5.04%)

HLR

73.33%

(3.33%)

55.00%

(5.57%)

58.62%

(4.96%)

59.09%

(5.02%)

80.65%

(2.31%)

52.63%

(5.24%)

86.76%

(1.88%)

52.27%

(5.71%)

L o g S u m + L_{2}

85.00%

(1.73%)

70.00%

(2.83%)

82.76%

(2.04%)

69.09%

(3.11%)

87.10%

(1.78%)

76.32%

(2.71%)

92.10%

(0.50%)

71.77%

(4.32%)

Open in a new tab

Numbers in parentheses are the standard deviations and the best results are highlighted in bold.

Table 3 shows the feature selection of all competing regularization methods. As shown in Table 3, these are the β-Sensitivity and β-Specificity. The approximate results are similar to the previous two Tables. In the same $ρ$ value, the LogSum + L₂ penalized logistic regression model contains the greatest number of features and highest sensitivity and specificity. And in different $ρ$ value, the performance of $ρ$ = 0.6 always greater than the performance of $ρ$ = 0.2.

Table 3.

Results of $β$ -sensitivity, $β$ -specificity obtained by four methods. (Numbers in parentheses are the standard deviations and the best results are highlighted in bold).

Method

Scenario

β-Sensitivity

β-Specificity

0.2

L_{1}

73.45%

(2.95%)

71.53%

(2.94%)

99.90%

(0.01%)

95.81%

(0.52%)

L_{EN}

73.16%

(2.26%)

71.31%

(2.32%)

99.95%

(0.01%)

76.65%

(2.73%)

HLR

74.62%

(3.05%)

73.15%

(2.89%)

99.95%

(0.01%)

95.45%

(1.92%)

L o g S u m + L_{2}

82.57%

(2.58%)

80.11%

(2.74%)

99.95%

(0.01%)

99.60%

(0.01%)

0.6

L_{1}

64.18%

(3.56%)

62.43%

(4.62%)

99.70%

(0.01%)

95.00%

(0.73%)

L_{EN}

65.36%

(3.63%)

63.34%

(4.13%)

99.95%

(0.01%)

76.00%

(3.04%)

HLR

65.41%

(3.81%)

63.62%

(4.51%)

99.90%

(0.01%)

95.96%

(0.65%)

L o g S u m + L_{2}

73.50%

(2.92%)

72.31%

(3.86%)

99.85%

(0.01%)

99.24%

(0.01%)

Open in a new tab

Analysis of real data

We use three publicly available lung cancer microarray datasets, which download from GEO (https://www.ncbi.nlm.nih.gov/geo/). Some detail information and introduction will be shown below:

GSE10072: Series GSE10072 is a gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. Tobacco smoking can cause 90% of lung cancer cases, but the changes in the level of the molecules that lead to cancer development and affect survival are still unclear.
GSE19188: Series GSE19188 is a dataset about gene expression for early stage Non-small-cell lung carcinoma (NSCLC). 156 tumors and normal samples are aggregated into the expected group. The prognostic characteristics of 17 genes showed the best correlation with the survival time after surgery.
GSE19804: Series GSE19804 is a dataset about Genome-wide screening of transcriptional modulation in non-smoking female lung cancer in Taiwan. Although smoking is a major risk factor for lung cancer, only 7% of women with lung cancer in Taiwan have a history of smoking, which is much lower than that of white women. Researchers extracted RNA from paired tumors and normal tissues for gene expression analysis to explain this phenomenon. This dataset and its reports comprehensively analyze the molecular characteristics of lung cancer in non-smoking women in Taiwan.

The GSE10072 dataset contains 22,284 microarray gene expression profiles and GSE19188 and GSE 19,804 both have 54,675 microarray gene expression profiles. As same as simulation data, we randomly divide the datasets such that 70% of the datasets become training samples and 30% become test samples. A brief introduction of these datasets is summarized in Table 4.

Table 4.

Three publicly available lung cancer gene expression datasets.

Dataset	No. of probes	Classes (Class1/Class2)	No. of sample (Class1/Class2)
GSE10072	22,284	Normal/Lung Cancer	107 (49/58)
GSE19188	54,675	Normal/Lung Cancer	156 (88/91)
GSE19804	54,675	Normal/Lung Cancer	120 (60/60)

Open in a new tab

Table 5 describes the average training and test accuracies are obtained by different variable selection methods in the three datasets. It is easy to find that the performance of the LogSum + L₂ penalized logistic regression model is better than other three approaches. For example, in terms of training accuracy, the $L o g S u m + L_{2}$ approach reached 99.43%, and other three methods are 98.32%, 99.04% and 98.21% respectively in GSE10072 dataset. In GSE19188 dataset, we observe the test accuracy of the $L o g S u m + L_{2}$ method is 75%, and other three methods are 51.46%, 47.56% and 46.19% respectively. From the number of selected genes, we can find the LogSum + L₂ penalized logistic regression model always select the lowest number of genes and the $L_{EN}$ approach select the highest number of genes.

Table 5.

Training and test accuracy and number of selected genes of three lung cancer datasets in four methods.

Data

Method

Training accuracy

Test accuracy

No. selected genes

GSE10072

L_{1}

98.32%

(0.14%)

95.12%

(0.31%)

(1.97)

HLR

99.04%

(0.04%)

98.4%

(0.17%)

(8.45)

LEN

98.21%

(0.16%)

92.1%

(0.94%)

(1.32)

L o g S u m + L_{2}

99.43%

(0.02%)

99.15%

(0.08%)

(0.82)

GSE19188

L_{1}

97.11%

(0.21%)

51.46%

(6.05%)

(9.33)

HLR

98.33%

(0.09%)

47.56%

(7.41%)

121

(10.34)

LEN

96.3%

(0.28%)

46.19%

(5.23%)

(2.03)

L o g S u m + L_{2}

99.25%

(0.01%)

75%

(3.44%)

(1.21)

GSE19804

L_{1}

99.05%

(0.02%)

95.2%

(0.61%)

(4.32)

HLR

99.05%

(0.02%)

94.6%

(0.64%)

(7.73)

LEN

97.14%

(0.22%)

96.6%

(0.58%)

(1.03)

L o g S u m + L_{2}

99.41%

(0.01%)

98.45%

(0.23%)

(0.82)

Open in a new tab

Numbers in parentheses are the standard deviations and the best results are highlighted in bold.

In order to search the common gene signatures selected by the different methods, we used VENNY software to generate Venn diagrams. As show in Fig. 4, we consider the common gene signatures selected by the logistic regression model with $L_{1}$ , $L_{EN}$ , $HLR$ and $L o g S u m + L_{2}$ regularizations, which are the most relevant signatures of lung cancer. Many genes selected by the LogSum + L₂ penalized logistic regression model do not appear in the results of the other three regularization methods. For example, the $L o g S u m + L_{2}$ approach selects 5, 6, and 3 unique genes from GSE10072, GSE19188 and GSE19804 datasets respectively. This means that the LogSum + L₂ penalized logistic regression model can find the different genes and pathways related to lung cancer compared with other three regularization methods.

Venn diagram analysis of the results of $L_{1}$ , $L_{EN}$ , $HLR$ and $L o g S u m + L_{2}$ regularization methods.

Figures 5, 6 and 7 show the interactive networks of all the features selected by the LogSum + L₂ penalized logistic regression model. The integrative networks among these selected features are represented by the cBioPortal from publicly lung cancer datasets. The circles with thick border represent the selected genes, and the rest circles with gradient color-coded represent genes according to their alteration frequencies in databases. The hexagons represent target drugs, and among of them some with yellow color represent the drugs approved by FDA. The links connected some selected genes represent that they have regulation correlations with group effect.

Maximum Integrative Network of features selected by the *LogSum* + L₂ penalized logistic regression model in GSE10072 dataset.

Maximum Integrative Network of features selected by the *LogSum* + L₂ penalized logistic regression model in GSE19188 dataset.

Maximum Integrative Network of features selected by the *LogSum* + L₂ penalized logistic regression model in GSE19804 dataset.

In GSE10072 dataset, from Fig. 5, we find a gene named EGFR, which has been conformed as the important target gene of NSCLC⁴⁰. It belongs to ERBB receptor tyrosine kinase family, which include some other genes like HER2, HER3 and HER4. Due to observed patterns of oncogenic mutation of EGFR and HER2, many research works report their attractive option for targeted therapy in patients with NSCLC.

As shown in Fig. 6, three important genes TUBB1, PRKD1 and STK11 have been selected, and genes PRKD1 and STK11 have the regulation correlation with group effect from GSE19188 dataset. In fact, there are many drugs have been developed to target the gene TUBB1. And many research works report that genes PRKD1 and STK11 significantly influence the patients’ survival rates across all tumors⁴¹.

As shown in Fig. 7, four important genes EPCAM, SMC3, HIST1H2BL, and LMNA and their regulation correlations with group effect have been selected from GSE19804 dataset. Many research works report that the epithelial cell adhesion molecule (EPCAM) represents true oncogenes as the tumor-associated calcium signal transducer, and study the relationship between gene EPCAM and NSCLC^4,2.

Table 6 summarizes that the genes were selected by the LogSum + L₂ penalized logistic regression model. At the beginning of the experiments, the attribute of genes is prob set ID. Thus, we could transform prob set ID to gene symbol by using the website DAVID (https://david.ncifcrf.gov). According to the experimental results, the LogSum + L₂ penalized logistic regression model can find some unique genes, which cannot be identified by other regularization models but are significantly related to the disease. Therefore, we believe that the LogSum + L₂ penalized logistic regression model can accurately and efficiently identify cancer-related genes.

Table 6.

The genes are selected by the LogSum + L₂ penalized logistic regression model for different datasets.

Prob_ID	Gene symbol	Gene name
Dataset: GSE10072
201839_s_at	EPCAM	Epithelial cell adhesion molecule (EPCAM)
200685_at	SRSF11	Serine and arginine rich splicing factor 11(SRSF11)
204600_at	EPHB3	EPH receptor B3(EPHB3)
205297_s_at	CD79B	CD79b molecule (CD79B)
202932_at	YES1	YES proto-oncogene 1, Src family tyrosine kinase (YES1)
201983_s_at	EGFR	Epidermal growth factor receptor (EGFR)
201596_x_at	KRT18	Keratin 18(KRT18)
Dataset: GSE19188
204292_x_at	STK11	Serine/threonine kinase 11(STK11)
205880_at	PRKD1	Protein kinase D1(PRKD1)
208694_at	PRKDC	Protein kinase, DNA-activated, catalytic polypeptide (PRKDC)
205868_s_at	PTPN11	Protein tyrosine phosphatase, non-receptor type 11(PTPN11)
214250_at	NUMA1	Nuclear mitotic apparatus protein 1(NUMA1)
231657_s_at	CCDC74A	Coiled-coil domain containing 74A(CCDC74A)
220939_s_at	DPP8	Dipeptidyl peptidase 8(DPP8)
210704_at	FEZ2	Fasciculation and elongation protein zeta 2(FEZ2)
208601_s_at	TUBB1	Tubulin beta 1 class VI(TUBB1)
207660_at	DMD	Dystrophin (DMD)
Dataset: GSE19804
1553655_at	CDC20B	Cell division cycle 20B(CDC20B)
201839_s_at	EPCAM	Epithelial cell adhesion molecule (EPCAM)
1552370_at	C4ORF33	Chromosome 4 open reading frame 33(C4orf33)
1556925_at	SMC3	Structural maintenance of chromosomes 3(SMC3)
207611_at	HIST1H2BL	Histone cluster 1 H2B family member l(HIST1H2BL)
1554600_s_at	LMNA	Lamin A/C(LMNA)

Open in a new tab

Discussion and conclusion

Successful identification of gene biomarkers and biological pathways can significantly improve the accuracy of diagnosis and help machine learning models have better performance on classification of different types of cancer. Many researchers used the logistic regressions with optimization methods for binary cancer classification. However, the traditional logistic regression model has two obvious shortcomings: feature selection and overfitting problems. In this paper, we proposed the $L o g S u m + L_{2}$ penalized logistic regression model. Our proposed method can not only select sparse features (biomakers), but also identify the groups of the relevant features (gene pathways). The coordinate decent algorithm is used to solve the LogSum + L₂ penalized logistic regression model. We also evaluate the capability of our proposed method and compare its performance with other regularization methods. The results of simulations and real experiments indicate that the proposed method is highly competitive among several state-of-the-art methods. The disadvantage of the proposed method is its three regularization parameters need to be tuned by the k-fold cross-validation approach.

In recent years, increasing associations between of microRNAs (miRNAs) and human diseases have been identified. Based on accumulating biological data, many computational models for potential miRNA-disease associations inference have been developed^43–46. We will apply the proposed LogSum + L₂ penalized logistic regression model to identify the non-coding RNA biomarker of human complex diseases as the future direction of our research.

Author contributions

X.Y.L. and S.B.W. conceived the conception, designed and developed the method, acquired and analyzed the data and result. W.Q.Z., Z.J.Y. and H.B.X. wrote, reviewed and revised the manuscript. X.Y.L. is the correspondence author. All authors have read and approved the final manuscript.

Funding

Tis work is supported in part by the Key Project for University of Department of Education of Guangdong Province of China Funds (Natural) under Grant.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002;46(1–3):389–422. doi: 10.1023/A:1012487302797. [DOI] [Google Scholar]
2.Heller MJ. DNA microarray technology: Devices, systems, and applications. Annu. Rev. Biomed. Eng. 2002;4(1):129–153. doi: 10.1146/annurev.bioeng.4.020702.153438. [DOI] [PubMed] [Google Scholar]
3.Greenbaum D, Colangelo C, Williams K, Gerstein M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003;4(9):1–8. doi: 10.1186/gb-2003-4-9-117. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Hawkins DM. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004;44(1):1–12. doi: 10.1021/ci0342472. [DOI] [PubMed] [Google Scholar]
5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002;97(457):77–87. doi: 10.1198/016214502753479248. [DOI] [Google Scholar]
6.Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20(15):2429–2437. doi: 10.1093/bioinformatics/bth267. [DOI] [PubMed] [Google Scholar]
7.Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 2005;48(4):869–885. doi: 10.1016/j.csda.2004.03.017. [DOI] [Google Scholar]
8.Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005;3(02):185–205. doi: 10.1142/S0219720005001004. [DOI] [PubMed] [Google Scholar]
9.Monari G, Dreyfus G. Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model. Neurocomputing. 2000;35(1–4):195–201. doi: 10.1016/S0925-2312(00)00325-8. [DOI] [Google Scholar]
10.Rivals I, Personnaz L. MLPs (mono-layer polynomials and multi-layer perceptrons) for nonlinear modeling. J. Mach. Learn. Res. 2003;3:1383–1398. [Google Scholar]
11.Liu XY, Liang Y, Wang S, Yang ZY, Ye HS. A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access. 2018;6:22863–22874. doi: 10.1109/ACCESS.2018.2818682. [DOI] [Google Scholar]
12.Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn Res. 2003;3:1157–1182. [Google Scholar]
13.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001;96(456):1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
14.Zhang HH, Lu W. Adaptive Lasso for Cox's proportional hazards model. Biometrika. 2007;94(3):691–703. doi: 10.1093/biomet/asm037. [DOI] [Google Scholar]
15.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010;38(2):894–942. doi: 10.1214/09-AOS729. [DOI] [Google Scholar]
16.Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann. Stat. 2007;35:1012–1030. doi: 10.1214/009053606000001370. [DOI] [Google Scholar]
17.Xu Z, Zhang H, Wang Y, Chang X, Liang Y. L1/2 regularization. Sci. China Inf. Sci. 2010;53(6):1159–1169. doi: 10.1007/s11432-010-0090-0. [DOI] [Google Scholar]
18.Xu Z, Chang X, Xu F, Zhang H. L1/2 regularization: A thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 2012;23(7):1013–1027. doi: 10.1109/TNNLS.2012.2197412. [DOI] [PubMed] [Google Scholar]
19.Candes EJ, Wakin MB, Boyd SP. Enhancing sparsity by reweighted L1 minimization. J. Fourier Anal. Appl. 2008;14(5–6):877–905. doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]
20.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 2006;68(1):49–67. doi: 10.1111/j.1467-9868.2005.00532.x. [DOI] [Google Scholar]
21.Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 2005;67(2):301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]
22.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann. Stat. 2004;32(2):407–499. doi: 10.1214/009053604000000067. [DOI] [Google Scholar]
23.Fan J, Li R. Variable selection for Cox's proportional hazards model and frailty model. Ann. Stat. 2002;30:74–99. doi: 10.1214/aos/1015362185. [DOI] [Google Scholar]
24.Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 2009;37(4):1733. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Zeng L, Xie J. Group variable selection via SCAD-L 2. Statistics. 2014;48(1):49–66. doi: 10.1080/02331888.2012.719513. [DOI] [Google Scholar]
26.Huang HH, Liu XY, Liang Y. Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2+ 2 regularization. PLoS ONE. 2016;11(5):e0149675. doi: 10.1371/journal.pone.0149675. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16(10):906–914. doi: 10.1093/bioinformatics/16.10.906. [DOI] [PubMed] [Google Scholar]
28.Medjahed SA, Saadi TA, Benyettou A. Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int. J. Comput. Appl. 2013;62(1):1–5. [Google Scholar]
29.Zhou X, Liu KY, Wong ST. Cancer classification and prediction using logistic regression with Bayesian gene selection. J. Biomed. Inform. 2004;37(4):249–259. doi: 10.1016/j.jbi.2004.07.009. [DOI] [PubMed] [Google Scholar]
30.Cawley GC, Talbot NL. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics. 2006;22(19):2348–2355. doi: 10.1093/bioinformatics/btl386. [DOI] [PubMed] [Google Scholar]
31.Algamal ZY, Lee MH. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif. 2019;13(3):753–771. doi: 10.1007/s11634-018-0334-1. [DOI] [Google Scholar]
32.Algamal Z. An efficient gene selection method for high-dimensional microarray data based on sparse logistic regression. Electron. J. Appl. Stat. Anal. 2017;10(1):242–256. [Google Scholar]
33.Shevade SK, Keerthi SS. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics. 2003;19(17):2246–2253. doi: 10.1093/bioinformatics/btg308. [DOI] [PubMed] [Google Scholar]
34.Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Methodol.) 1996;58(1):267–288. [Google Scholar]
35.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33(1):1. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Algamal ZY, Lee MH. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015;42(23):9326–9332. doi: 10.1016/j.eswa.2015.08.016. [DOI] [PubMed] [Google Scholar]
37.Algamal ZY, Lee MH. Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput. Biol. Med. 2015;67:136–145. doi: 10.1016/j.compbiomed.2015.10.008. [DOI] [PubMed] [Google Scholar]
38.Liang Y, Liu C, Luan XZ, Leung KS, Chan TM, Xu ZB, Zhang H. Sparse logistic regression with a L 1/2 penalty for gene selection in cancer classification. BMC Bioinform. 2013;14(1):198. doi: 10.1186/1471-2105-14-198. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Xia LY, Wang YW, Meng DY, Yao XJ, Chai H, Liang Y. Descriptor selection via log-sum regularization for the biological activities of chemical structure. Int. J. Mol. Sci. 2018;19(1):30. doi: 10.3390/ijms19010030. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Jänne PA, Yang JCH, Kim DW, Planchard D, Ohe Y, Ramalingam SS, Haggstrom D. AZD9291 in EGFR inhibitor–resistant non–small-cell lung cancer. N. Engl. J. Med. 2015;372(18):1689–1699. doi: 10.1056/NEJMoa1411817. [DOI] [PubMed] [Google Scholar]
41.Nath A, Chan C. Genetic alterations in fatty acid transport and metabolism genes are associated with metastatic progression and poor prognosis of human cancers. Sci. Rep. 2016;6:18669. doi: 10.1038/srep18669. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Pak MG, Shin DH, Lee CH, Lee MK. Significance of EpCAM and TROP2 expression in non-small cell lung cancer. World J. Surg. Oncol. 2012;10(1):53. doi: 10.1186/1477-7819-10-53. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–4265. doi: 10.1093/bioinformatics/bty503. [DOI] [PubMed] [Google Scholar]
44.Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: From experimental results to computational models. Brief. Bioinform. 2019;20(2):515–539. doi: 10.1093/bib/bbx130. [DOI] [PubMed] [Google Scholar]
45.Chen X, Yin J, Qu J, Huang L. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction. PLoS Comput. Biol. 2018;14(8):e1006418. doi: 10.1371/journal.pcbi.1006418. [DOI] [PMC free article] [PubMed] [Google Scholar]
46.Chen X, Yan CC, Zhang X, You ZH. Long non-coding RNAs and complex diseases: From experimental results to computational models. Brief. Bioinform. 2017;18(4):558–576. doi: 10.1093/bib/bbw060. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR1] 1.Guyon I, Weston J, Barnhill S, Vapnik V. Gene selection for cancer classification using support vector machines. Mach. Learn. 2002;46(1–3):389–422. doi: 10.1023/A:1012487302797. [DOI] [Google Scholar]

[CR2] 2.Heller MJ. DNA microarray technology: Devices, systems, and applications. Annu. Rev. Biomed. Eng. 2002;4(1):129–153. doi: 10.1146/annurev.bioeng.4.020702.153438. [DOI] [PubMed] [Google Scholar]

[CR3] 3.Greenbaum D, Colangelo C, Williams K, Gerstein M. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003;4(9):1–8. doi: 10.1186/gb-2003-4-9-117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Hawkins DM. The problem of overfitting. J. Chem. Inf. Comput. Sci. 2004;44(1):1–12. doi: 10.1021/ci0342472. [DOI] [PubMed] [Google Scholar]

[CR5] 5.Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J. Am. Stat. Assoc. 2002;97(457):77–87. doi: 10.1198/016214502753479248. [DOI] [Google Scholar]

[CR6] 6.Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20(15):2429–2437. doi: 10.1093/bioinformatics/bth267. [DOI] [PubMed] [Google Scholar]

[CR7] 7.Lee JW, Lee JB, Park M, Song SH. An extensive comparison of recent classification tools applied to microarray data. Comput. Stat. Data Anal. 2005;48(4):869–885. doi: 10.1016/j.csda.2004.03.017. [DOI] [Google Scholar]

[CR8] 8.Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J. Bioinform. Comput. Biol. 2005;3(02):185–205. doi: 10.1142/S0219720005001004. [DOI] [PubMed] [Google Scholar]

[CR9] 9.Monari G, Dreyfus G. Withdrawing an example from the training set: An analytic estimation of its effect on a non-linear parameterised model. Neurocomputing. 2000;35(1–4):195–201. doi: 10.1016/S0925-2312(00)00325-8. [DOI] [Google Scholar]

[CR10] 10.Rivals I, Personnaz L. MLPs (mono-layer polynomials and multi-layer perceptrons) for nonlinear modeling. J. Mach. Learn. Res. 2003;3:1383–1398. [Google Scholar]

[CR11] 11.Liu XY, Liang Y, Wang S, Yang ZY, Ye HS. A hybrid genetic algorithm with wrapper-embedded approaches for feature selection. IEEE Access. 2018;6:22863–22874. doi: 10.1109/ACCESS.2018.2818682. [DOI] [Google Scholar]

[CR12] 12.Guyon I, Elisseeff A. An introduction to variable and feature selection. J. Mach. Learn Res. 2003;3:1157–1182. [Google Scholar]

[CR13] 13.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001;96(456):1348–1360. doi: 10.1198/016214501753382273. [DOI] [Google Scholar]

[CR14] 14.Zhang HH, Lu W. Adaptive Lasso for Cox's proportional hazards model. Biometrika. 2007;94(3):691–703. doi: 10.1093/biomet/asm037. [DOI] [Google Scholar]

[CR15] 15.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann. Stat. 2010;38(2):894–942. doi: 10.1214/09-AOS729. [DOI] [Google Scholar]

[CR16] 16.Rosset S, Zhu J. Piecewise linear regularized solution paths. Ann. Stat. 2007;35:1012–1030. doi: 10.1214/009053606000001370. [DOI] [Google Scholar]

[CR17] 17.Xu Z, Zhang H, Wang Y, Chang X, Liang Y. L1/2 regularization. Sci. China Inf. Sci. 2010;53(6):1159–1169. doi: 10.1007/s11432-010-0090-0. [DOI] [Google Scholar]

[CR18] 18.Xu Z, Chang X, Xu F, Zhang H. L1/2 regularization: A thresholding representation theory and a fast solver. IEEE Trans. Neural Netw. Learn. Syst. 2012;23(7):1013–1027. doi: 10.1109/TNNLS.2012.2197412. [DOI] [PubMed] [Google Scholar]

[CR19] 19.Candes EJ, Wakin MB, Boyd SP. Enhancing sparsity by reweighted L1 minimization. J. Fourier Anal. Appl. 2008;14(5–6):877–905. doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]

[CR20] 20.Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 2006;68(1):49–67. doi: 10.1111/j.1467-9868.2005.00532.x. [DOI] [Google Scholar]

[CR21] 21.Zou H, Hastie T. Regularization and variable selection via the elastic net. J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 2005;67(2):301–320. doi: 10.1111/j.1467-9868.2005.00503.x. [DOI] [Google Scholar]

[CR22] 22.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. Ann. Stat. 2004;32(2):407–499. doi: 10.1214/009053604000000067. [DOI] [Google Scholar]

[CR23] 23.Fan J, Li R. Variable selection for Cox's proportional hazards model and frailty model. Ann. Stat. 2002;30:74–99. doi: 10.1214/aos/1015362185. [DOI] [Google Scholar]

[CR24] 24.Zou H, Zhang HH. On the adaptive elastic-net with a diverging number of parameters. Ann. Stat. 2009;37(4):1733. doi: 10.1214/08-AOS625. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Zeng L, Xie J. Group variable selection via SCAD-L 2. Statistics. 2014;48(1):49–66. doi: 10.1080/02331888.2012.719513. [DOI] [Google Scholar]

[CR26] 26.Huang HH, Liu XY, Liang Y. Feature selection and cancer classification via sparse logistic regression with the hybrid L1/2+ 2 regularization. PLoS ONE. 2016;11(5):e0149675. doi: 10.1371/journal.pone.0149675. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Furey TS, Cristianini N, Duffy N, Bednarski DW, Schummer M, Haussler D. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics. 2000;16(10):906–914. doi: 10.1093/bioinformatics/16.10.906. [DOI] [PubMed] [Google Scholar]

[CR28] 28.Medjahed SA, Saadi TA, Benyettou A. Breast cancer diagnosis by using k-nearest neighbor with different distances and classification rules. Int. J. Comput. Appl. 2013;62(1):1–5. [Google Scholar]

[CR29] 29.Zhou X, Liu KY, Wong ST. Cancer classification and prediction using logistic regression with Bayesian gene selection. J. Biomed. Inform. 2004;37(4):249–259. doi: 10.1016/j.jbi.2004.07.009. [DOI] [PubMed] [Google Scholar]

[CR30] 30.Cawley GC, Talbot NL. Gene selection in cancer classification using sparse logistic regression with Bayesian regularization. Bioinformatics. 2006;22(19):2348–2355. doi: 10.1093/bioinformatics/btl386. [DOI] [PubMed] [Google Scholar]

[CR31] 31.Algamal ZY, Lee MH. A two-stage sparse logistic regression for optimal gene selection in high-dimensional microarray data classification. Adv. Data Anal. Classif. 2019;13(3):753–771. doi: 10.1007/s11634-018-0334-1. [DOI] [Google Scholar]

[CR32] 32.Algamal Z. An efficient gene selection method for high-dimensional microarray data based on sparse logistic regression. Electron. J. Appl. Stat. Anal. 2017;10(1):242–256. [Google Scholar]

[CR33] 33.Shevade SK, Keerthi SS. A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics. 2003;19(17):2246–2253. doi: 10.1093/bioinformatics/btg308. [DOI] [PubMed] [Google Scholar]

[CR34] 34.Tibshirani R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc.: Ser. B (Methodol.) 1996;58(1):267–288. [Google Scholar]

[CR35] 35.Friedman J, Hastie T, Tibshirani R. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 2010;33(1):1. doi: 10.18637/jss.v033.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Algamal ZY, Lee MH. Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification. Expert Syst. Appl. 2015;42(23):9326–9332. doi: 10.1016/j.eswa.2015.08.016. [DOI] [PubMed] [Google Scholar]

[CR37] 37.Algamal ZY, Lee MH. Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification. Comput. Biol. Med. 2015;67:136–145. doi: 10.1016/j.compbiomed.2015.10.008. [DOI] [PubMed] [Google Scholar]

[CR38] 38.Liang Y, Liu C, Luan XZ, Leung KS, Chan TM, Xu ZB, Zhang H. Sparse logistic regression with a L 1/2 penalty for gene selection in cancer classification. BMC Bioinform. 2013;14(1):198. doi: 10.1186/1471-2105-14-198. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Xia LY, Wang YW, Meng DY, Yao XJ, Chai H, Liang Y. Descriptor selection via log-sum regularization for the biological activities of chemical structure. Int. J. Mol. Sci. 2018;19(1):30. doi: 10.3390/ijms19010030. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Jänne PA, Yang JCH, Kim DW, Planchard D, Ohe Y, Ramalingam SS, Haggstrom D. AZD9291 in EGFR inhibitor–resistant non–small-cell lung cancer. N. Engl. J. Med. 2015;372(18):1689–1699. doi: 10.1056/NEJMoa1411817. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Nath A, Chan C. Genetic alterations in fatty acid transport and metabolism genes are associated with metastatic progression and poor prognosis of human cancers. Sci. Rep. 2016;6:18669. doi: 10.1038/srep18669. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Pak MG, Shin DH, Lee CH, Lee MK. Significance of EpCAM and TROP2 expression in non-small cell lung cancer. World J. Surg. Oncol. 2012;10(1):53. doi: 10.1186/1477-7819-10-53. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Chen X, Wang L, Qu J, Guan NN, Li JQ. Predicting miRNA–disease association based on inductive matrix completion. Bioinformatics. 2018;34(24):4256–4265. doi: 10.1093/bioinformatics/bty503. [DOI] [PubMed] [Google Scholar]

[CR44] 44.Chen X, Xie D, Zhao Q, You ZH. MicroRNAs and complex diseases: From experimental results to computational models. Brief. Bioinform. 2019;20(2):515–539. doi: 10.1093/bib/bbx130. [DOI] [PubMed] [Google Scholar]

[CR45] 45.Chen X, Yin J, Qu J, Huang L. MDHGI: Matrix Decomposition and Heterogeneous Graph Inference for miRNA-disease association prediction. PLoS Comput. Biol. 2018;14(8):e1006418. doi: 10.1371/journal.pcbi.1006418. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR46] 46.Chen X, Yan CC, Zhang X, You ZH. Long non-coding RNAs and complex diseases: From experimental results to computational models. Brief. Bioinform. 2017;18(4):558–576. doi: 10.1093/bib/bbw060. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

LogSum + L2 penalized logistic regression model for biomarker selection and cancer classification

Xiao-Ying Liu

Sheng-Bing Wu

Wen-Quan Zeng

Zhan-Jiang Yuan

Hong-Bo Xu

Abstract

Introduction

Related works

Sparse penalized logistic regression

A coordinate decent algorithm for different thresholding operators

Methods