Abstract
Precise cancer subtype and/or stage prediction is instrumental for cancer diagnosis, treatment and management. However, most of the existing methods based on genomic profiles suffer from issues such as overfitting, high computational complexity and selected features (i.e., genes) not directly related to forecast precision. These deficiencies are largely due to the nature of “high dimensionality and small sample size” inherent in molecular data, and such a nature is often deemed as an obstacle to the application of deep learning, e.g., deep neural networks (DNNs), to biomedicine and cancer research. In this paper, we propose a DNN-based algorithm coupled with a new embedded feature selection technique, named Dropfeature-DNNs, to address these issues. Dropfeature-DNNs can discard some irrelevant features (i.e., genes) when training DNNs, and we formulate Dropfeature-DNNs as an iterative AUC optimization problem. As such, an “optimal” feature subset that contains meaningful genes for accurate tumor subtype and/or stage prediction can be obtained when the AUC optimization converges in the training stage. Since the feature subset and AUC optimizations are synchronous with the training phase of DNNs, model complexity and computational cost are simultaneously reduced. Rigorous feature subset convergence analysis and error bound inference provide a solid theoretical foundation for the proposed method. Extensive empirical comparisons to benchmark methods further demonstrate the efficacy of Dropfeature-DNNs in cancer subtype and/or stage prediction using HDSS gene expression data from multiple cancer types.
Index Terms—: Deep neural networks, cancer subtype, tumor stage, high dimensionality, Dropfeature-DNNs
1. Introduction
Cancer is one of the leading causes of death, and a major public health problem worldwide. The American Cancer Society estimates that in 2020 there will be approximate 1.8 million new cancer cases diagnosed and 606,520 cancer deaths in the United States [1]. Compared to other diseases, cancer is a collection of heterogeneous diseases. For example, seven molecular subtypes of prostate cancer, each characterized by cancer-driving gene fusions or genetic mutations have been reported recently [2]. These subtypes affect treatment response, prognosis of patients and recurrence. In addition, patients’ treatment plan and survival outlook also depend on how much cancer is within the body (tumor size) and how far it has spread at the time one is diagnosed, which is known as tumor staging [3], [4], [5]. Numerous clinical trials have shown that accurate cancer subtyping and/or tumor staging will promote the selection of patients that benefit most from specific therapies and design of novel targeted drugs [6], [7], [8], [9]. Therefore, the ability to precisely stratify or classify cancer patients based on the specific characteristics of their cancer type and/or tumor stage is the first vital step toward cancer etiology understanding and developing personalized cancer treatment and therapies.
With the advent of high-throughput sequencing, gene expression data has been widely used to complement the traditional clinical methods in accurately identifying and predicting tumor subtypes and pathological stages of cancers [10], [11], [12]. This practice brings considerable benefits to cancer diagnosis, management and therapy. However, classification of cancer patients based on genomic data is far from trivial. It not only demands the careful preparation, collection and measurement of many tumor samples, but also requires the use of novel and sophisticated computational and statistical approaches to pinpoint relevant groups of samples as well as biomarkers that enable detection of these groups in the clinic. Over the past decades, numerous pertinent investigations and medical practices have been performed [13], [14], [15], [16], [17], [18], [19]. In most of these studies, traditional statistical and machine learning methods, including paired-samples test algorithm (PST) [14], unsupervised hierarchical clustering [15], and Case Based Reasoning (CBR) with gradient boosting based feature selection [18], have been employed. These methods, although effective in certain contexts, are generally limited in dealing with gene expression data that is often characterized by high dimensionality, small size (HDSS) properties and multiple biological categories. For example, the aforementioned methods trained on HDSS genomic data often exhibit serious overfitting or deteriorated and unstable performance due to high variance in the model itself and insufficient training on limited samples.
Deep learning (DL) represents a set of revolutionary machine learning techniques, and is becoming the main stream method in the fields of computer vision [20], natural language processing [21], and biomedical informatics [22], [23], [24], [25], [26], [27] among others. The main advantages of deep learning include high-level feature abstraction, automatic discovery of intricate structure from large amount of data, and state-of-the-art performance [28]. Nevertheless, the promising performance achieved by deep learning often requires large numbers of samples, making this collection of methods not readily applicable to HDSS genomic data. To address this issue, several studies have been performed to combine feature selection techniques with deep learning methods. For example, Li et al. [29] develop a deep feature selection model to choose input features for multiclass data, where an elastic net is utilized in the input layer for feature selection. However, the optimization of elastic net regularization significantly prolongs the training phase of deep learning. Ibrahim et al. [30] use deep belief networks (DBNs) and unsupervised active learning to select genes/miRNAs based on expression profiles. However, for high-dimensional data, this method even underperforms the traditional Recursive Feature Elimination (RFE) algorithm [13], [31]. Fakoor et al. [32] propose an unsupervised feature learning approach for cancer diagnosis and classification using gene expression data. Principal component analysis (PCA) is utilized for dimensionality reduction and deep autoencoders (DAEs) are implemented to obtain high-level feature extraction, which somewhat make this approach difficult for model interpretation. More recently, Liu et al. [33] propose a deep neural pursuit (DNP) method to handle the HDSS data. This method obtains unstable feature selection results on extremely high-dimensional data (e.g., approximately 20,000 features). The implications and undesired results of the aforementioned methods inspired us to develop a new feature selection technique that can leverage the power of deep learning to achieve better prediction performance on HDSS data.
In this paper, we propose a DNN-based algorithm coupled with a novel embedded feature selection technique, named Dropfeature-DNNs, to handle HDSS gene expression data often encountered by tumor subtype and/or stage prediction tasks. Dropfeature-DNNs can heuristically reduce the original feature space or drop some irrelevant features (i.e., genes) when calculating the gradients for training deep neural networks (DNNs). The main idea of Dropfeature-DNNs is to iteratively determine an “optimal” feature subset with the goal of maximizing DNNs’ classification AUC in the training stage. As such, a binary Dropfeature vector ϵt, which can dynamically track the selected AUC-relevant features, is defined and integrated into the input layer of DNNs. Dropfeature-DNNs is a simple yet effective method to handle HDSS genomic data. This method, on one hand, can substantially alleviate overfitting via its explicit and dynamic dimensionality reduction mechanism. On the other hand, through iteratively feeding DNNs with features truly contributing to the model’s forecast precision, Dropfeature-DNNs is able to not only decrease the variance of gradients due to the limited sample size, but also capture the non-linear interactions of such features via abstract high-level representations, leading to better generalization performance. Moreover, theoretical analyses show that Dropfeature-DNNs has a high chance to achieve the “optimal” feature subset. Empirically, such a feature subset indexed by ϵt can be obtained over multiple training iterations as soon as the AUC optimization converges. Since these two processes are completely synchronous with the training phase of DNNs, Dropfeature-DNNs is expected to have a lower model complexity, reduced computational cost and high interpretability, which are also exhibited in the experimental studies. Extensive empirical comparisons to benchmark methods further demonstrate the efficacy of Dropfeature-DNNs in cancer subtype and/or stage prediction using HDSS gene expression data from different cancer types.
The rest of this paper is organized as follows. We summarize the related work in Section 2. We elaborate on the proposed Dropfeature-DNNs method in Section 3. The experimental results are presented in Section 4. The conclusion is stated in Section 5.
2. Related Work
In this section, we summarize the conventional and deep feature selection methods for high-dimensional data analysis, with a focus on methods particularly developed for handling the HDSS data.
In general, feature selection methods can be classified into three categories: filters, wrappers and embedded methods. The filter methods (e.g., RELIEF [34], Fisher score [35]) focus on the intrinsic properties of data, which is independent of the construction of classifiers. Conversely, wrappers (e.g., SVM-RFE [31]) utilize classifiers to score a subset of features; while embedded methods (e.g., HSIC-LASSO [36] (LASSO for short), GBFS [37]) incorporate the feature selection process directly into the learning process. However, these methods demand large samples for model training. Although some of them (e.g., HSIC-LASSO [36] and GBFS [37]) fit into the HDSS setting, they are two-stage algorithms, separating feature selection from the classification process.
Recent deep learning based methods provide a creative way to incorporate the regularization techniques for deep feature selection in this end-to-end learning paradigm [38]. For example, the elastic net regularization term is added to the loss function of deep model for identifying the important features of regulatory events [29]. The deep models with group sparse regularization, group LASSO and ℓ0 regularization proposed in [39], [40] and [41], respectively, offer a new procedure for feature selection in large-scale classification problems. Chen et al. [42] develop a novel deep learning method named DeepType to identify cancer subtypes using high-dimensional genomic data. DeepType performs joint supervised classification, unsupervised clustering and dimensionality reduction through the cross entropy minimization and ℓ2,1-norm regularization. However, these methods generally still require large sample sizes and the optimization of the loss function somewhat suffers from high variance.
In addition to the regularization techniques, some deep learning methods also incorporate more advanced approaches for deep feature selection. For example, the concrete autoencoder introduced in [43] for global feature selection identifies a subset of the most informative features and simultaneously trains a neural network to reconstruct the input data from the selected features. The main deficiency of this method is that the number of selected features needs to be determined in advance. Deep feature selection using paired-input nonlinear knockoffs (DeepPINK) [44] is proposed to increase the interpretability and reproducibility of DNNs. As a variant of DeepPINK [44], deep group-feature selection using knockoffs (Deep-gKnock) [45] is developed to train DNNs for model interpretation and dimension reduction. However, the selected features bounded by the controlled error rate in DeepPINK and group-wise false discovery rate in Deep-gKnock are not sparse for the model’s interpretability, leading to some irrelevant features being chosen. In addition, DNP [33] is deliberately proposed to solve the HDSS problem by taking the average over multiple dropouts to calculate gradients with low variance. Although somewhat similar to DNP, Dropfeature-DNNs is a more efficient and compelling method to classify HDSS genomic data using a simple yet effective dynamic feature selection procedure and three-way synchronization.
3. Proposed Method
In this section, we present the Dropfeature-DNNs method and its implementation algorithms. Moreover, two lemmas and two theorems are provided to exhibit the convergence and error bound of Dropfeature-DNNs.
3.1. Dropfeature-DNNs: The Method
Given a gene expression matrix , where n is the number of samples and d is the number of genes or features, we design a feature selection mechanism named Dropfeature to randomly choose and update features for DNNs training by iteratively optimizing the classification AUC. Let ϵ denote a Dropfeature vector of d dimensions, i.e., ϵ = [ϵ1, …, ϵj, …, ϵd]T ∈ {0, 1}d, ϵj ∈ {0, 1} (ϵj = 1 represents the j-th feature is selected, otherwise it is dropped). The updated features for the next-round DNNs training can be obtained by , where is the i-th row of and the operator ⊙ represents the element-wise multiplication.
Fig. 1 outlines the flow of the proposed Dropfeature-DNNs method. Specifically, by initializing ϵ1, we train a feedforward DNNs with a small feature subset of X by . Through the back-propagation algorithm [46] with the overall loss minimized, we update the parameters of DNNs and obtain a classification AUC score. Then, according to the proposed Dropfeature-DNNs algorithm, we heuristically update ϵ2 to achieve an improved AUC score. This process continues until no significant improvement is achieved in the AUC. In this way, we can continuously feed the DNNs with the heuristically selected genes over multiple training iterations. As such, Dropfeature-DNNs is able to reduce the variance when calculating gradients of DNNs and further alleviates overfitting and insufficient training.
Fig. 1:
The overview of the proposed Dropfeature-DNNs method to classify cancer subtypes and/or tumor stages using HDSS gene expression data.
Let denote the joint distribution of (, Y) and denote the marginal distribution of , where is the m-class label matrix of n samples with each belonging to one class. With the Dropfeature mechanism, the objective function for optimizing DNNs is defined as follows.
(1) |
where f(xi ⊙ ϵ; θ) denotes the prediction function determined by DNNs, θ is the parameters of DNNs, Θ is the parameter space, is the cross-entropy loss function that measures the inconsistency between and y. R(·) is the regularization term (e.g., in our experiments) for avoiding overfitting, and η is the regularization rate. Eq. (1) aims to optimize the expected loss for empirical risk minimization.
We use the stochastic gradient descent (SGD) algorithm [47] to minimize the objective function in Eq. (1), and the update rule of θ can be expressed by
(2) |
where t is the current iteration index.
Specifically, the update rule for iteratively solving Eq. (1) can be expressed by
(3) |
where , , and .
Moreover, Dropfeature-DNNs with L hidden layers has the following architecture
(4) |
(5) |
(6) |
(7) |
where θ = {Win, W1, …, WL, Wout, bin, b1, …, bL, bout}, Zout and Zl are the output and hidden neurons with the corresponding weight matrices Wout, Wl, Win, and bias vectors bout, bl, bin (l = 1, 2, …, L). σ(·) is the activation function such as ReLU, sigmoid, or tanh. softmax(·) is the softmax function converting values of the output layer into predicted probabilities,
(8) |
where represents the true label vector of the k-th class (k = 1, 2, …, m), which is the k-th basis vector of .
3.2. Dropfeature-DNNs: The Algorithms
We summarize the overall learning process of Dropfeature-DNNs in Algorithm 1, which invokes the proposed Dropfeature mechanism presented in Algorithm 2 to update the feature subset for the next-round of DNNs training.
Specifically, let be the set of genes under consideration. The greedy algorithm could be used to identify a subset of that results in a classifier to yield a better predictive performance than using itself. However, it is infeasible to train DNNs with every subset of as there are 2d such subsets. As a result, we propose the following novel strategy. We start with a random feature subset for DNNs training and calculate the AUC score A obtained by a learner . We then use the random walking strategy [48] to update the feature subset and the corresponding feature index set ϵ′. Specifically, starting from the current state S = 0, we have three equal probabilities to update the following three states using random walking: (1) state S = 1: we choose a random neighbor of by removing a subset of features g from and simultaneously adding a subset of features g′ (|g| =|g′| ≥ 1) from into . We then compute its AUC score A1 obtained by ; (2) state S = 2: we choose a random neighbor by removing a subset of features g from and compute its AUC score A2 obtained by ; and (3) state S = 3: we choose a random neighbor by adding a subset of features g′ from into and compute its AUC score A3 obtained by . We then compare these AUC scores and update the feature subset as follows. If Ai = max{A, A1, A2, A3}, we move to the feature subset and proceed with the search from . If A = max{A, A1, A2, A3} and Ai = max{A1, A2, A3}, we stay with the current the feature subset with probability ui or move to the feature subset with probability (1 − ui), where ui = exp{−c(A − Ai)} and c > 0 is a model parameter. This step is to ensure that the method does not get stuck in a local minimum when optimizing DNNs. After that, we reinitialize the current state by S = 0 and start the above processes until no significant improvement is achieved in AUC score. Algorithm 2 summarizes the update procedure of Dropfeature-DNNs.
Algorithm 1.
Dropfeature-DNNs: Overall Procedure for Training DNNs with Dropfeature
Input: Initial parameters (i.e., weight matrices and bias vectors) of DNNs, θ0; input data, x1, x2,…, xn; input target classes, y1, y2,…, yn; activation functions, σ(·); prediction function, f(·); loss function, ℓ(·); learning rate, λ; regularization rate, η. | |
Output: The learned set of parameters, θT+1. | |
1: | Initialize a feature subset and the corresponding feature index set ϵ0; |
2: | for t = 1, 2,…, T do |
3: | Execute Algorithm 2 and get the update feature index set ϵt; |
4: | Compute derivative, ; |
5: | Update the parameter set, ; |
6: | end for |
7: | return θT +1. |
3.3. Dropfeature-DNNs: A Theoretical Perspective
We perform runtime analysis and convergence proof for the Dropfeature algorithm. In Algorithm 2, it is apparent that each run of the loop (Lines 4–24) takes O(d) time complexity. This can be reduced to O(log d) by keeping and as balanced trees. We provide the following two lemmas and two theorems to exhibit the convergence and error bound of Dropfeature-DNNs. Theorems 3.2 and 3.4 indicate that Dropfeature-DNNs has a high probability to achieve the “optimal” feature subset and maximize the classification AUC with a solid theoretical bound, respectively. In the following, we assume that |g| = |g′| = 1, that is, only one feature is added to and/or removed from the feature subset. Similar results can be derived if |g| = |g′| ≥ 2.
We can conceive of a directed graph G(V, E) for the feature selection problem as follows: each subset of the d features is a node in G. From the node , there will be edges to its neighbors. In Algorithm 2, for the node , there are three types of neighbors. Let be a neighbor of . If is of the first state, then the number of features in and will be the same. Thus, there are such neighbors. If is of the second state, then has one more feature than . Finally, if is of the third state, then has one fewer feature than . Let and be the number of neighbors in of the second and third states, respectively. Similarly, we have and . As a result, it follows that there will be ≤ 3d neighbors for the node in the graph G(V, E).
Algorithm 2.
Dropfeature for Training DNNs
Input: The input feature subset, ; the feature index set, ϵ; the learning algorithm, . | |
Output: The updated feature subset and index set, and ϵ′. | |
1: | begin |
2: | Run on , and compute AUC A by ; |
3: | Initialize the current state S = 0; |
4: | repeat |
5: | Use random walking to update the following three states with equal probabilities; |
6: | if state S = 1 then |
7: | Remove and add to get and ϵ1; |
8: | Run on and compute AUC A1 by ; |
9: | else if state S = 2 then |
10: | Remove to get and ϵ2; |
11: | Run on and compute AUC A2 by ; |
12: | else if state S = 3 then |
13: | Add to get and ϵ3; |
14: | Run on and compute AUC A3 by ; |
15: | end if |
16: | if (Ai = max {A, A1, A2, A3}) then |
17: | Update , ϵ′ = ϵi, A = Ai and search from ; |
18: | else if (A = max {A, A1, A2, A3}, and Ai = max {A1, A2, A3}) then |
19: | Search from with probability ui and update , ϵ′ = ϵ, and A = A; |
20: | or |
21: | Search from with probability 1 − ui and update , ϵ′ = ϵi, and A = Ai; |
22: | end if |
23: | Reinitialize the current state S = 0; |
24: | until no significant improvement is achieved in AUC score. |
25: | return and ϵ′. |
26: | end |
Algorithm 2 starts from a random node in G and performs a random walk in this graph. The next node to be visited can be one of the three states with a equal probability. It depends only on the current node. We can thus model the Dropfeature process as a time-homogeneous Markov chain.
Note that for two nodes and in G, there is a directed path from to and hence G is strongly connected. For the node in G, let be the classification accuracy using the feature subset corresponding to . If is the reference node, and is a neighbor of and is of state S = i (i = 1, 2, 3), then the transition probability from to is given by:
(9) |
and .
The feature subset selected by Dropfeature-DNNs will converge if the underlying Markov chain is in the “optimal” feature subset at least once. Let be the starting node of the Markov chain and let be the “optimal” feature subset. Then there is a path from to of length ≤ d.
Let . Also, let the degree and diameter of G(V, E) be N and D, respectively. Clearly, N ≤ 3d and D ≤ d. Note that, if is a neighbor of , then . We can derive that Dropfeature-DNNs has a high chance to achieve the “optimal” feature subset as summarized in Lemma 3.1 and Theorem 3.2. The detailed proofs are provided in Supplementary files S1 and S2, respectively.
Lemma 3.1. If is a state in V , then the expected number of steps before the “optimal” feature subset is visited starting from is .
Theorem 3.2. For any integer m ≥ 1, the feature subset selected by Dropfeature-DNNs converges to the “optimal” feature subset in time ≤ 2m(3N exp{c△})D with probability ≥ (1 − 2−m), which is independent of the start state.
The representation capability of DNNs can be reflected by the broad learning with a shallow architecture [49]. Therefore, we can restrict the learning process of Dropfeature-DNNs as shallow learning, i.e., f(xi ⊙ ϵ; θ) = θT(xi ⊙ ϵ). Furthermore, the update rule for iteratively solving the objective function can be expressed by
(10) |
We then derive Lemma 3.3 and Theorem 3.4 for the error bound of Dropfeature-DNNs. The detailed proofs can be located in Supplementary files S3 and S4, respectively.
Lemma 3.3. Let θt+1 = θt − λgt and θ1 = 0, then for any ∥θ*∥2 ≤ R, we have
(11) |
Theorem 3.4. Let be the expected risk of θ, and assume that and the loss function is C-Lipschitz continuous. For any ∥θ*∥2 ≤ R, by appropriately choosing λ, we have
(12) |
where is taking expectation over the randomness of (xi, yi, ϵt), i = 1, 2, …, n, and t ∈ [T] = {1, 2, …, T}.
4. Experimental Results
In this section, we empirically demonstrate the efficacy of Dropfeature-DNNs through a series of comparative studies, functional enrichment analysis, running time comparison and parameter sensitivity analysis.
4.1. Datasets and Evaluation Metrics
Seven microarray and RNA-Seq. gene expression datasets of different cancer types are used in the empirical comparison. We downloaded microarray datasets (i.e., ALL, GCM, NCI Tumors, and Lung Cancer) from CPD1 and GEMS2, and the TCGA RNA-Seq. datasets (i.e., Colorectal Adenocarcinoma (COAD), Uterine Corpus Endometrial Carcinoma (ENCA), and Prostate Adenocarcinoma (PRAD)) from cBioPortal3. Table 1 summarizes the characteristics of these datasets, and all of them suffer from the HDSS problem. To evaluate the performance of all competing methods in these multiclass classification scenarios, we adopt the standard evaluation metrics, i.e., sensitivity, specificity, and Area Under ROC Curve (AUC).
TABLE 1:
The average AUC scores with the corresponding standard deviation of competing algorithms on all datasets. The best results are highlighted in bold and the symbol • indicates that Dropfeature-DNNs obtains a better AUC than the corresponding method regarding a pairwise t-test with a 95% confidence interval.
Measurement | Attribute | AUC [%] | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dataset/Method | #sample | #feature | #class | DNNs | LASSO-SVM | GBFS-SVM | LASSO-DNNs | GBFS-DNNs | SNMF | CNN | Dropout | DNP | Dropfeature-DNNs |
ALL | 248 | 12,558 | 6 (subtype) | 73.5±2.6• | 77.6±2.4• | 79.1±1.8• | 78.3±3.2• | 82.6±3.5• | 86.3±2.8• | 83.5±1.8• | 84.8±2.1• | 85.4±2.3• | 89.1±1.8 |
GCM | 198 | 16,063 | 14 (subtype) | 59.4±4.8• | 66.8±3.6• | 70.5±2.6• | 68.2±3.9• | 74.5±2.2• | 73.6±2.9• | 72.8±1.6• | 74.6±2.0• | 72.5±2.0• | 77.6±2.4 |
NCI Tumors | 174 | 12,533 | 11 (type) | 65.7±4.2• | 67.5±2.3• | 69.4±3.2• | 72.3±4.3• | 73.6±3.0• | 75.4±2.5 | 73.9±1.3• | 74.3±1.6• | 74.8±1.8• | 76.1±2.9 |
Lung Cancer | 203 | 12,600 | 5 (subtype) | 83.0±4.6• | 84.9±3.2• | 86.8±2.5• | 88.7±3.3• | 90.6±2.1• | 92.1±2.7• | 90.3±2.0• | 92.5±2.2• | 93.3±2.1• | 96.2±2.8 |
COAD | 358 | 20,531 | 4 (stage) | 63.8±1.6• | 60.1±2.2• | 61.5±3.3• | 66.3±4.8• | 72.5±3.1 | 75.9±3.0 | 72.8±2.5 | 73.5±1.8 | 74.2±1.9 | 73.3±2.5 |
ENCA | 232 | 20,484 | 4 (subtype) | 62.5±5.3• | 65.6±4.5• | 68.8±3.4• | 71.9±4.4• | 75.0±5.3• | 79.1±3.2• | 76.1±2.1• | 78.0±1.9• | 78.4±2.5• | 81.3±3.7 |
ENCA | 331 | 20,484 | 4 (stage) | 73.6±3.1• | 72.1±3.7• | 75.6±2.1• | 73.7±3.0• | 79.2±2.4• | 84.2±1.8• | 82.6±1.8• | 83.0±1.8• | 83.3±1.6• | 86.5±2.3 |
PRAD | 376 | 20,502 | 8 (subtype) | 71.7±3.9• | 72.8±1.8• | 74.1±4.0• | 75.6±2.4• | 78.5±1.5• | 80.3±1.9• | 79.6±1.8• | 80.8±1.7• | 81.0±1.8• | 82.3±1.5 |
4.2. Competing Algorithms
We compare Dropfeature-DNNs with multiple baseline methods: DNNs [46], LASSO [36], GBFS [37], SNMF [50], CNN [51], Dropout [52], and DNP [33]. Among them, LASSO [36] and GBFS [37] are two representative non-linear feature selection methods; SNMF [50] is a semi-supervised non-negative matrix factorization method; DNNs [46], CNN [51], Dropout [52] and DNP [33] are the deep learning based methods. In particular, DNP [33] is for handling HDSS data. In the implementation, we combine LASSO [36] and GBFS [37] with a SVM classifier (i.e., LASSO-SVM and GBFS-SVM using the RBF kernel) since SVM with the One-vs-Rest strategy is superior to other methods for multi-class classification [53], [54]. For more thorough comparisons, we also couple LASSO [36] and GBFS [37] with DNNs (i.e., LASSO-DNNs and GBFS-DNNs). In the experiments, we implement SNMF, CNN, DNP, and Dropfeature-DNNs in Matlab. The Matlab codes of DNNs4, LASSO5, GBFS6, and Dropout7 are directly obtained from [46], [36], [37], and [55], respectively. The parameter values of all baselines are determined based upon the recommendations of the corresponding methods.
4.3. Experimental Settings
We report the average classification performance of all methods over one hundred replication runs. In each run, we employ five-fold cross-validation with approximately 48% of the data for training, 12% of the data for validation, and 40% of the data for testing. Through cross-validation in preliminary studies, we design a deep neural network with four layers: the number of neurons in the first and second layer is 1,000, and the number of neurons in the third and fourth layer is 100. We adopt the sigmoid activation function in each hidden layer and use the softmax function in the output layer. The weights of DNNs are updated through back-propagation and all layers are retrained using a global learning rate of 0.001, regularization rate of 0.01, Dropout [52] rate of 0.5, and c = 1 using the grid search. We set the maximal number of epochs for training DNNs to be 100, and the number of iterations to be 1,000. ADAM [56] is used as the optimizer to update the parameters of all the DNNs based methods. All the experiments are conducted using MATLAB-R2017b on a Windows machine with 3.70 GHz Intel(R) CPU and 64.0 GB RAM.
4.4. MSE of Dropfeature-DNNs
In order to determine the parameters of Dropfeature-DNNs and alleviate overfitting, we dynamically study the mean squared error (MSE) of Dropfeature-DNNs and the best training epoch for each dataset in the training and validation stages. For example, Fig. 2 presents the dynamic MSE change of Dropfeature-DNNs across different epochs on the “ALL” and “ENCA (stage)” datasets. It is evident that after the 27-th and 54-th training epochs, the Dropfeature-DNNs’ MSE curves respectively obtained on the validation sets of “ALL” and “ENCA (stage)” are almost stable. However, its training MSEs still decreases, which indicates an occurrence of overfitting. As such, we choose the 27-th and 54-th epochs as the best training epochs of Dropfeature-DNNs on “ALL” and “ENCA (stage)” datasets, respectively. A closer examination shows that at the 27-th (or 54-th) epoch, Dropfeature-DNNs simultaneously obtains 95.6% (or 91.5%) training and 93.3% (or 88.4%) validation AUC scores on the “ALL” (or “ENCA (stage)”) dataset. We use the same procedure to determine the best training epoch for Dropfeature-DNNs on other datasets.
Fig. 2:
The dynamic MSE performance of Dropfeature-DNNs in both training and validation stages across different training epochs.
4.5. AUC, Sensitivity, and Specificity Comparisons
Table 1 presents the average AUC scores of the compared algorithms on seven genomic datasets. Dropfeature-DNNs consistently outperforms other methods by achieving higher AUCs on most of the datasets except “COAD”. This indicates that our proposed method is effective in cancer subtype and/or tumor stage prediction using the HDSS data. The benefits in using Dropfeature-DNNs can be ascribed to its embedded mechanism in maximizing the classification AUC over multiple iterations while training DNNs. SNMF leads other baselines and is competitive with Dropfeature-DNNs as it is powered by semi-supervised learning and NMF. Compared to the conventional methods (i.e., LASSO-SVM and GBFS-SVM), the deep learning based methods are more capable of accurately predicting various cancer subtypes and/or tumor stages on these HDSS datasets. This is typically the case for the “Lung Cancer” and “ENCA (stage)” datasets with larger sets of genes but fewer samples. Specifically, with respect to LASSO-SVM and GBFS-SVM, DNP (13.3% and 9.7%), Dropout (13.0% and 9.5%), CNN (11.3% and 7.8%), LASSO-DNNs (8.3% and 4.6%), and GBFS-DNNs (9.8% and 6.5%) achieve higher AUC scores indicated by the corresponding percent increases listed in the parentheses. In addition, DNP achieves slightly higher AUC scores than Dropout and CNN since it takes an average of multiple dropouts for DNNs training. Although SNMF, DNP, Dropout and CNN obtain promising results, Dropfeature-DNNs is still superior to them by achieving a 3.4% higher AUC score.
We also report the sensitivity and specificity of each algorithm in Table 2, where a higher sensitivity (or specificity) represents fewer false negatives (or false positives) achieved by a method. In terms of sensitivity, Dropfeature-DNNs consistently leads the results, suggesting that our proposed method is able to achieve fewer false negatives. For the specificity comparison, Dropfeature-DNNs outperforms the baselines on all the datasets except “NCI Tumors”, “Lung Cancer”, and “COAD”, indicating that our proposed method tends to obtain fewer false positives. Statistically, Dropfeature-DNNs significantly achieves a 10.6% (or 7.8%) higher sensitivity (or specificity) than the baseline methods.
TABLE 2:
The average sensitivity/specificity (sens./spec.) scores of competing algorithms on all datasets. The best results are highlighted in bold and the symbol • indicates that Dropfeature-DNNs obtains a better sensitivity/specificity than the corresponding method regarding a pairwise t-test with a 95% confidence interval.
Measurement | sens./spec. [%] | |||||||||
---|---|---|---|---|---|---|---|---|---|---|
Dataset/Method | DNNs | LASSO-SVM | GBFS-SVM | LASSO-DNNs | GBFS-DNNs | SNMF | CNN | Dropout | DNP | Dropfeature-DNNs |
ALL | 70.5•/74.6• | 75.2•/78.7• | 76.6•/81.3• | 77.4•/79.8• | 80.2•/84.0• | 87.8/86.4• | 82.3•/83.5• | 85.8•/84.7• | 86.3•/85.7• | 88.6/90.8 |
GCM | 55.8•/57.3• | 64.6•/70.4• | 71.2•/67.5• | 69.5•/65.7• | 71.3•/72.6• | 71.9•/74.4• | 70.3•/71.2• | 71.0•/73.6• | 70.4•/75.2• | 73.8/78.5 |
NCI Tumors | 63.4•/66.5• | 63.8•/69.6• | 65.7•/69.3• | 70.1•/72.8 | 71.8•/70.3• | 75.1•/72.3 | 72.1•/71.3 | 72.5•/73.8 | 73.6•/70.1• | 78.6/71.2 |
Lung Cancer | 81.5•/85.6• | 80.8•/86.7• | 84.2•/88.5• | 87.9•/86.4• | 87.8•/90.9• | 91.5•/92.6 | 88.5•/89.3• | 90.6•/92.0 | 92.1•/91.5 | 94.7/92.2 |
COAD | 60.8•/62.3• | 58.7•/61.3• | 59.1•/61.8• | 63.2•/64.9• | 72.4•/70.5 | 73.5/76.8 | 70.4•/71.1 | 71.8•/70.3 | 72.3•/68.4 | 74.6/69.1 |
ENCA (subtype) | 59.4•/61.7• | 62.6•/67.3• | 64.6•/63.2• | 69.2•/71.8• | 72.6/73.9• | 70.5•/71.3• | 71.5•/69.6• | 72.0•/71.3• | 71.4•/69.5• | 73.6/75.9 |
ENCA (stage) | 69.7•/71.8• | 70.3•/72.5• | 71.6•/74.3• | 70.5•/75.6• | 73.4•/75.8• | 78.2•/77.6• | 75.5•/76.3• | 75.2•/78.8• | 76.6•/79.1• | 81.5/83.8 |
PRAD | 68.1•/69.2• | 68.8•/70.5• | 71.3•/75.1• | 73.6•/76.8• | 75.2•/77.3• | 75.8•/78.4• | 74.4•/73.6• | 76.0•/78.4• | 76.4•/80.2• | 80.6/82.9 |
4.6. Confusion Matrix and ROC Curve Comparisons
In addition to the aforementioned quantitative evaluations, we also examine the confusion matrices and ROC curves generated by each method on all the datasets in a single replication run. Take the “ALL” dataset as an example. Fig. 3 presents the corresponding results of all the methods for six cancer subtypes classification (i.e., subtype 1: BCRABL; subtype 2: E2A-PBX1; subtype 3: Hyperdiploid; subtype 4: MLL; subtype 5: T-ALL; and subtype 6: TEL-AML1). Several observations can be summarized as follows. First, although LASSO-SVM, GBFS-SVM, GBFS-DNNs can accurately classify most of the majority cancer subtypes (i.e., subtypes 3, 5 and 6), they fail to correctly classify some of the minority ones (i.e., subtypes 1, 2 and 4). For example, LASSO-SVM and GBFS-DNNs misclassify all the patients within subtype 1, and GBFS-SVM misses all patients in subtype 4. In contrast, LASSO-DNNs, SNMF, Dropout, CNN, DNP and Dropfeature-DNNs perform better in minority subtype classification. Second, compared to LASSO-DNNs, SNMF, DNP, Dropout and CNN, Dropfeature-DNNs achieves much higher precisions in classifying most of the majority subtypes, though its prediction accuracies on minority subtypes are slightly inferior to those of CNN and Dropout. Third, the ROC curves achieved by Dropfeature-DNNs for subtypes 2–6 consistently cover those achieved by other methods as Dropfeature-DNNs tends to obtain lower False Positive Rates but higher True Positive Rates. Fig. 4 further quantitatively validates the efficacy of Dropfeature-DNNs as it consistently achieves higher subtype-specific AUC scores than the competing methods.
Fig. 3:
The illustration of confusion matrices and ROC curves of all methods obtained on the “ALL” dataset for six cancer subtypes classification.
Fig. 4:
Subtype-specific AUC scores achieved by all the methods on the “ALL” dataset.
Similarly, Fig. 5 presents the confusion matrices and ROC curves obtained by all the methods on “ENCA (stage)” for four tumor stages prediction. It is evident that although almost all the methods can accurately classify the majority classes (i.e., stages I and III), three methods (i.e., DNNs, LASSO-SVM, and GBFS-SVM) fail to classify samples belonging to the minority classes (i.e., stages II and IV). Meanwhile, LASSO-DNNs, GBFS-DNNs, SNMF, CNN, Dropout, and DNP badly misclassify most of the patients belonging to stages III and IV as stage I. Dropfeature-DNNs, on the other hand, obtains much higher precisions than LASSO-DNNs, GBFS-DNNs and DNP in classifying patients belonging to stages III and IV. This property is very desirable as misclassifying patients in the late tumor stage would delay the diagnosis and treatment, leading to increased morbidity. As shown by the stage or class specific ROC curves, Dropfeature-DNNs also consistently outperforms other baselines by achieving relatively lower False Positive Rates and higher True Positive Rates. The stage-specific AUC comparison presented in Fig. 6 further demonstrates the merits of Dropfeature-DNNs.
Fig. 5:
The illustration of confusion matrices and ROC curves of all methods obtained on the “ENCA (stage)” dataset for four tumor stages prediction.
Fig. 6:
Stage-specific AUC scores achieved by all the methods on the “ENCA (stage)” dataset.
4.7. AUC Comparisons with Varied Selected Features
Fig. 7 presents the dynamic AUC curves of all the methods as the number of selected feature is varied in the range of [100, 10,000] and [500, 20,000] on “ALL” and “ENCA (stage)” datasets, respectively. Overall, we observe that Dropfeature-DNNs consistently achieves higher AUC scores than the other seven (deep) feature selection methods. Specifically, LASSO-SVM, LASSO-DNNs, CNN, and Dropfeature-DNNs achieve their best AUCs when the number of selected features is nearly 1,000, while GBFS-SVM, GBFS-DNNs, and Dropout obtain the best AUCs when the number of selected features is nearly 2,000 on the “ALL” dataset (Fig. 7(a)). When the number of selected features increases from 2,000 to 10,000 (nearly all features are used on the “ALL” dataset), the AUC scores of all methods become stable and finally almost the same except that of LASSO-SVM. With only 100 features, Dropfeature-DNNs still significantly achieves a higher AUC than the other baselines except DNP. This indicates that Dropfeature-DNNs tends to achieve higher AUC scores while using fewer features. Similar results are observed on the “ENCA (stage)” dataset (Fig. 7(b)).
Fig. 7:
AUC comparisons with varied selected features.
4.8. Dynamically Selected Genes and Functional Enrichment Analysis
Fig. 8 presents the genes (highlighted in grey) dynamically selected by Dropfeature-DNNs from the “PRAD” dataset across different replication runs. It is clear that genes such as ATM, FOXA1, MED12, ERG, SPOP, PTEN, PIK3CD and TP53 are selected multiple times as the common genes in PRAD, and those genes have been reported to play significant roles in prostate tumorigenesis [57], [58]. In addition, genes such as CRK, BCL6 and NF1, which have no documented functional evidence in this disease, are discarded by Dropfeature-DNNs. Eventually, using majority voting, we obtain 545 genes selected by Dropfeature-DNNs for the prediction of prostate cancer subtypes.
Fig. 8:
The dynamically selected genes (highlighted in grey) and dropped genes (in white) by Dropfeature-DNNs from the “PRAD” dataset across different replication runs.
Regarding these genes, we further perform functional enrichment analysis using the DAVID tool [59]. We find that 42 KEGG pathways are over-represented by those genes with the Benjamini-Hochberg (BH) adjusted enrichment p-values being less than 0.01. The profile of the top 20 significant pathway is depicted by Fig. 9. They include a few cancer pathways that are critical to prostate cancer progression, clinical and therapy, such as “TGF-β signaling pathway” [60], [61], “Wnt signaling pathway” [62], [63], [64] and others.
Fig. 9:
The top 20 KEGG pathways over-represented by the 545 feature genes selected by Dropfeature-DNNs for predicting prostate cancer subtypes.
4.9. AUC Comparisons with Varied Training Proportions
Fig. 10 shows the dynamic AUC comparisons as the ratio of training data (i.e., training proportion) is varied in the range of [0.1, 0.9] on “ALL” and “ENCA (stage)” datasets. As expected, all the methods enjoy improved AUC scores when the training proportion is stably increasing in the range. Among them, Dropfeature-DNNs almost consistently leads the curves of all competing algorithms, followed by SNMF, DNP, Dropout, CNN, and GBFS-DNNs. This demonstrates that the non-linear feature selection methods for training DNNs are superior to the conventional ones such as LASSO-SVM, GBFS-SVM and LASSO-DNNs. DNNs, designed for large scale data, suffers the worst results since it uses all the features. The inferior performance of LASSO-SVM and GBFS-SVM is primarily due to insufficient training, however, they are competitive with LASSO-DNNs when given more training samples. Lastly, when only 10~30% of the samples are available for training, SNMF leads the curves compared with other methods. In this case, Dropfeature-DNNs still obtains slightly higher AUCs than other SVM- and DNNs-based methods including CNN, Dropout, and DNP.
Fig. 10:
AUC comparisons with varied training proportions.
4.10. Running Time Comparison
Fig. 11 presents the average elapsed time of one hundred replication runs by all deep learning based methods on the seven datasets. Among all the algorithms, DNNs using all the original features consumes the most amount of time, followed by CNN with much time consumed in computing convolutions and GBFS-DNNs with the modified gradient boosted trees. LASSO-DNNs is relatively faster than GBFS-DNNs since the ratio of selected features by LASSO and GBFS is approximate 2:3 when training DNNs. Dropfeature-DNNs is the most efficient algorithm on all datasets, and it is slightly faster than Dropout and DNP. The reasons are two-fold: first, the number of its dynamically selected features is dramatically fewer than that of the original ones; and second, the update process of Dropfeature-DNNs is synchronous with the training procedure of DNNs. Quantitatively, the time consumption of Dropfeature-DNNs is around 78.5% and 76.3% of Dropout and DNP, respectively.
Fig. 11:
The average running time of one hundred replication runs for all compared methods.
4.11. Parameter Sensitivity Analysis
Since the learning rate of Dropfeature-DNNs is iteratively updated by the ADAM optimizer [56], we conduct the sensitivity analysis on the regularization parameter η and model parameter c of Dropfeature-DNNs. To investigate how the variations of these two model parameters affect the performance of Dropfeature-DNNs, we set those parameter values by grid search. That is, η is selected from [10−4, 5 × 10−4, 10−3, 5 × 10−3, 10−2, 5 × 10−2, 10−1], and c is selected from [0.1, 0.5, 1, 5, 10, 20, 50, 100]. As shown by Fig. 12, we compare the dynamic classification AUC scores on the “ALL” dataset when varying these two parameters. It is evident that as the values of c are fixed and η are varied, the performance of Dropfeature-DNNs is relatively stable with some small variation. Setting the regularization parameter η = 10−3 or 10−2 could improve Dropfeature-DNNs’ performance. Moreover, by fixing the value of η, we observe that the larger c is, the slightly higher AUC achieved by Dropfeature-DNNs. The reason is that larger c will give Dropfeature-DNNs a higher probability to escape the local minimum. When the values of parameter c are in the range of [0.5, 20], the performance of Dropfeature-DNNs is somewhat stable. Hence, Dropfeature-DNNs is relatively robust to both parameters. Similar results are obtained on other datasets.
Fig. 12:
The sensitivity analysis of regularization and model parameters of Dropfeature-DNNs on the “ALL” dataset.
5. Conclusion
Accurate stratification of cancer patients regarding tumor subtypes and/or stages based on genomic profiles is critical for cancer diagnosis, treatment and management. Therefore, there is an urgent need to develop effective machine learning techniques to facilitate this process. In this paper, we propose a DNN-based algorithm driven by a simple yet effective embedded feature selection technique, i.e., Dropfeature-DNNs, to classify cancer subtypes and/or tumor stages using HDSS genomic data. We introduce an element-wise multiplication procedure to heuristically reduce the original feature space and dynamically determine an “optimal” feature subset resulting in maximized AUCs via iterative DNNs training. By doing so, Dropfeature-DNNs is able to not only alleviate overfitting and reduce the variance of gradients due to the limited sample size, but also capture the non-linear interactions among features via abstract high-level representations, leading to enhanced generalization capacity. Meticulous analysis of feature subset convergence and error bound derivation provide theoretical guarantees for the proposed method. Using multiple HDSS gene expression data from varied cancer types, large-scale empirical comparisons to existing baseline methods further demonstrate the value of Dropfeature-DNNs in cancer subtype and/or stage prediction. As part of future work, we plan to extend Dropfeature-DNNs to integrate multi-view and/or multi-source HDSS omics data for patient prognosis analysis.
Supplementary Material
Acknowledgments
The authors thank the editor, three reviewers and Dr. Andrea Edwards for their critical, constructive, and editorial comments and suggestions. This work is partly supported by the funding from the NIH grants 2U54MD007595, 5P20GM103424 and U19AG055373, and a DOD ARO grant W911NF-20-1-0249. The contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH and DOD.
Biographies
Zhong Chen is a computational scientist in the Department of Computer Science, Xavier University of Louisiana. He received his Ph.D. in Computer Science from Wuhan University of Technology in 2015. His main research interests include deep learning, machine learning, bioinformatics and image processing. He has published 13 peer-reviewed articles in scientific journals such as Pattern Recognition Letters, Neurocomputing, Image and Vision Computing, and Frontiers in Oncology.
Wensheng Zhang is a bioinformatics scientist of Xavier RCMI center for cancer research. He received his Ph.D. in statistical genetics from the University of Georgia in 2004. His research interests include quantitative genetics, bioinformatics, cancer genomics and genomic epidemiology. Since 2003, he has published 25 first-authored papers in peer-review journals and proceedings such as Bioinformatics, BMC Genomics, and Cancer Epidemiology, Biomarkers & Prevention.
Hongwen Deng is the Edward G. Schlieder Endowed Chair professor and the Director of Center for Bioinformatics and Genomics, Department of Global Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University. He has extensive multi-/inter-disciplinary expertise in biostatistics and bioinformatics methodology research, genomics, genetic epidemiology, complex traits and diseases. His work is published in nearly 600 peer-reviewed publications including journals such as Nature and New England Journal of Medicine.
Kun Zhang is a professor of Computer Science, Xavier University of Louisiana. She received her Ph.D. in computer science with a concentration in machine learning and data mining from Tulane University in 2006. Her research interests include bioinformatics, machine learning and big data analytics. Her collaborative work with others received the IEEE ICDM’06 Best Application Paper Award, IEEE ICDM’08 Contest Crown Award and IEEE ICDM’10 Top-10 Data Mining Case Studies. She has published over 70 peer-reviewed articles in Nucleic Acids Research and Bioinformatics among others.
Footnotes
Contributor Information
Zhong Chen, Department of Computer Science, Bioinformatics Core of Xavier RCMI Center for Cancer Research, Xavier University of Louisiana, New Orleans, LA 70125, USA..
Wensheng Zhang, Department of Computer Science, Bioinformatics Core of Xavier RCMI Center for Cancer Research, Xavier University of Louisiana, New Orleans, LA 70125, USA..
Hongwen Deng, Center for Bioinformatics and Genomics, Tulane University, New Orleans, LA 70112, USA..
Kun Zhang, Department of Computer Science, Bioinformatics Core of Xavier RCMI Center for Cancer Research, Xavier University of Louisiana, New Orleans, LA 70125, USA..
References
- [1].Siegel RL, Miller KD, and Jemal A, “Cancer statistics, 2020,” CA: A Cancer Journal for Clinicians, vol. 70, no. 1, pp. 7–30, 2020. [DOI] [PubMed] [Google Scholar]
- [2].Abeshouse A, Ahn J, Akbani R, Ally A, Amin S, Andry CD, Annala M, Aprikian A, Armenia J, Arora A, and Auman JT, “The molecular taxonomy of primary prostate cancer,” Cell, vol. 163, no. 4, pp. 1011–1025, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Edge SB and Compton CC, “The American Joint Committee on Cancer: the 7th edition of the AJCC cancer staging manual and the future of TNM,” Annals of Surgical Oncology, vol. 17, no. 6, pp. 1471–1474, 2010. [DOI] [PubMed] [Google Scholar]
- [4].Giuliano AE, Connolly JL, Edge SB, Mittendorf EA, Rugo HS, Solin LJ, Weaver DL, Winchester DJ, and Hortobagyi GN, “Breast cancer-major changes in the American Joint Committee on Cancer eighth edition cancer staging manual,” CA: A Cancer Journal for Clinicians, vol. 67, no. 4, pp. 290–303, 2017. [DOI] [PubMed] [Google Scholar]
- [5].Buyyounouski MK, Choyke PL, McKenney JK, Sartor O, Sandler HM, Amin MB, Kattan MW, and Lin DW, “Prostate cancer-major changes in the American Joint Committee on Cancer eighth edition cancer staging manual,” CA: A Cancer Journal for Clinicians, vol. 67, no. 3, pp. 245–253, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].Sinicrope FA, Shi Q, Smyrk TC, Thibodeau SN, Dienstmann R, Guinney J, Bot BM, Tejpar S, Delorenzi M, Goldberg RM, and Mahoney M, “Molecular markers identify subtypes of stage III colon cancer associated with patient outcomes,” Gastroenterology, vol. 148, no. 1, pp. 88–99, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [7].Renfro LA, An MW, and Mandrekar SJ, “Precision oncology: A new era of cancer clinical trials,” Cancer Letters, vol. 387, pp. 121–126, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Dienstmann R, Vermeulen L, Guinney J, Kopetz S, Tejpar S, and Tabernero J, “Consensus molecular subtypes and the evolution of precision medicine in colorectal cancer,” Nature Reviews Cancer, vol. 17, no. 2, pp. 79–92, 2017. [DOI] [PubMed] [Google Scholar]
- [9].Trail PA, Dubowchik GM, and Lowinger TB, “Antibody drug conjugates for treatment of breast cancer: novel targets and diverse approaches in ADC design,” Pharmacology & Therapeutics, vol. 181, pp. 126–142, 2018. [DOI] [PubMed] [Google Scholar]
- [10].Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, and Bloomfield CD, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring,” Science, vol. 286, no. 5439, pp. 531–537, 1999. [DOI] [PubMed] [Google Scholar]
- [11].West M, Blanchette C, Dressman H, Huang E, Ishida S, Spang R, Zuzan H, Olson JA, Marks JR, and Nevins JR, “Predicting the clinical status of human breast cancer by using gene expression profiles,” Proceedings of the National Academy of Sciences, vol. 98, no. 20, pp. 11462–11467, 2001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Zuo T, Zeng H, Li H, Liu S, Yang L, Xia C, Zheng R, Ma F, Liu L, Wang N, and Xuan L, “The influence of stage at diagnosis and molecular subtype on breast cancer patient survival: a hospital-based multi-center study,” Chinese Journal of Cancer, vol. 36, no. 1, p.84, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Aliferis CF, Tsamardinos I, Massion PP, Statnikov AR, Fananapazir N, and Hardin DP, “Machine learning models for classification of lung cancer and selection of genomic markers using array gene expression data,” International FLAIRS Conference, pp.67–71, 2003. [Google Scholar]
- [14].Zhang W, Robbins K, Wang Y, Bertrand K, and Rekaya R, “A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information,” BMC Genomics, vol. 11, no. 1, p.273, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Blaveri E, Simko JP, Korkola JE, Brewer JL, Baehner F, Mehta K, DeVries S, Koppie T, Pejavar S, Carroll P, and Waldman FM, “Bladder cancer outcome and subtype classification by gene expression,” Clinical Cancer Research, vol. 11, no. 11, pp. 4044–4055, 2005. [DOI] [PubMed] [Google Scholar]
- [16].West L, Vidwans SJ, Campbell NP, Shrager J, Simon GR, Bueno R, Dennis PA, Otterson GA, and Salgia R, “A novel classification of lung cancer into molecular subtypes,” PLoS One, vol. 7, no. 2, e31906, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Leong HS, Galletta L, Etemadmoghadam D, George J, Australian Ovarian Cancer Study, Köbel M, Ramus SJ, and Bowtell D, “Efficient molecular subtype classification of high-grade serous ovarian cancer,” The Journal of Pathology, vol. 236, no. 3, pp. 272–277, 2015. [DOI] [PubMed] [Google Scholar]
- [18].Ramos-González J, López-Sánchez D, Castellanos-Garzón JA, de Paz JF, and Corchado JM, “A CBR framework with gradient boosting based feature selection for lung cancer subtype classification,” Computers in Biology and Medicine, vol. 86, pp. 98–106, 2017. [DOI] [PubMed] [Google Scholar]
- [19].Paner GP, Stadler WM, Hansel DE, Montironi R, Lin DW, and Amin MB,, “Updates in the eighth edition of the tumor-node-metastasis staging classification for urologic cancers,” European Urology, vol. 73, no. 4, pp. 560–569, 2018. [DOI] [PubMed] [Google Scholar]
- [20].Niitani Y, Ogawa T, Saito S, and Saito M, “Chainercv: a library for deep learning in computer vision,” In Proceedings of the 25th ACM international conference on Multimedia, pp. 1217–1220, 2017. [Google Scholar]
- [21].Young T, Hazarika D, Poria S, and Cambria E, “Recent trends in deep learning based natural language processing,” IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55–75, 2018. [Google Scholar]
- [22].Alipanahi B, Delong A, Weirauch MT, and Frey BJ, “Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning,” Nature Biotechnology, vol. 33, no. 8, pp. 831–838, 2015. [DOI] [PubMed] [Google Scholar]
- [23].Diao JA, Kohane IS, and Manrai AK, “Biomedical informatics and machine learning for clinical genomics,” Human Molecular Genetics, vol. 27, no. R1, pp. R29–R34, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Stead WW, “Clinical implications and challenges of artificial intelligence and deep learning,” JAMA, vol. 320, no. 11, pp. 1107–1108, 2018. [DOI] [PubMed] [Google Scholar]
- [25].Suárez-Paniagua V, Zavala RM, Segura-Bedmar I, and Martínez P, “A two-stage deep learning approach for extracting entities and relationships from medical texts,” Journal of Biomedical Informatics, vol. 99, p.103285, 2019. [DOI] [PubMed] [Google Scholar]
- [26].Kumar A, Singh SK, Saxena S, Lakshmanan K, Sangaiah AK, Chauhan H, Shrivastava S, and Singh RK, “Deep feature learning for histopathological image classification of canine mammary tumors and human breast cancer,” Information Sciences, vol. 508, pp. 405–421, 2020. [Google Scholar]
- [27].Jiao W, Atwal G, Polak P, Karlic R, Cuppen E, Danyi A, De Ridder J, van Herpen C, Lolkema MP, Steeghs N, and Getz G, “A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns,” Nature Communications, vol. 11, no. 1, pp. 1–12, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [28].LeCun Y, Bengio Y, and Hinton G, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015. [DOI] [PubMed] [Google Scholar]
- [29].Li Y, Chen CY, and Wasserman WW, “Deep feature selection: theory and application to identify enhancers and promoters,” Journal of Computational Biology, vol. 23, no. 5, pp. 322–336, 2016. [DOI] [PubMed] [Google Scholar]
- [30].Ibrahim R, Yousri NA, Ismail MA, and El-Makky NM, “Multi-level gene/MiRNA feature selection using deep belief nets and active learning,” International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3957–3960, 2014. [DOI] [PubMed] [Google Scholar]
- [31].Guyon I, Weston J, Barnhill S, and Vapnik V, “Gene selection for cancer classification using support vector machines,” Machine Learning, vol. 46, no. 1–3, pp. 389–422, 2002. [Google Scholar]
- [32].Fakoor R, Ladhak F, vNazi A, and Huber M, “Using deep learning to enhance cancer diagnosis and classification,” International Conference on Machine Learning, 2013. [Google Scholar]
- [33].Liu B, Wei Y, Zhang Y, and Yang Q, “Deep neural networks for high dimension, low sample size data,” International Joint Conference on Artificial Intelligence, pp. 2287–2293, 2017. [Google Scholar]
- [34].Kononenko I, “Estimating attributes: analysis and extensions of RELIEF,” European Conference on Machine Learning, pp. 171–182, 1994. [Google Scholar]
- [35].Gu Q, Li Z, and Han J, “Generalized fisher score for feature selection,” arXiv preprint, arXiv:1202.3725, 2012. [Google Scholar]
- [36].Yamada M, Jitkrittum W, Sigal L, Xing EP, and Sugiyama M, “High-dimensional feature selection by feature-wise kernelized lasso,” Neural Computation, vol. 26, no. 1, pp. 185–207, 2014. [DOI] [PubMed] [Google Scholar]
- [37].Xu Z, Huang G, Weinberger KQ, and Zheng AX, “Gradient boosted feature selection,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 522–531, 2014. [Google Scholar]
- [38].Taherkhani A, Cosma G, and McGinnity TM, “Deep-FS: A feature selection algorithm for Deep Boltzmann Machines,” Neurocomputing, vol. 322, pp. 22–37, 2018. [Google Scholar]
- [39].Scardapane S, Comminiello D, Hussain A, and Uncini A, “Group sparse regularization for deep neural networks,” Neurocomputing, vol. 241, no. 7, pp. 81–89, 2017. [Google Scholar]
- [40].Zhang H, Wang J, Sun Z, Zurada JM, and Pal NR, “Feature selection for neural networks using group lasso regularization,” IEEE Transactions on Knowledge and Data Engineering, vol. 32, no. 4, pp. 659–673, 2019. [Google Scholar]
- [41].Yamada Y, Lindenbaum O, Negahban S, and Kluger Y, “Feature selection using stochastic gates,” arXiv preprint, arXiv:1810.04247, 2018. [Google Scholar]
- [42].Chen R, Yang L, Goodison S, and Sun Y, “Deep-learning approach to identifying cancer subtypes using high-dimensional genomic data,” Bioinformatics, vol. 36, no. 5, pp. 1476–1483, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [43].Balin MF, Abid A, and Zou J, “Concrete autoencoders: Differentiable feature selection and reconstruction,” International Conference on Machine Learning, pp. 444–453, 2019. [Google Scholar]
- [44].Lu Y, Fan Y, Lv J, and Noble WS, “DeepPINK: reproducible feature selection in deep neural networks,” Advances in Neural Information Processing Systems, pp. 8676–8686, 2018. [Google Scholar]
- [45].Zhu G and Zhao T, “Deep-gKnock: nonlinear group-feature selection with deep neural network,” arXiv preprint, arXiv:1905.10013, 2019. [DOI] [PubMed] [Google Scholar]
- [46].Hinton GE and Salakhutdinov RR, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [DOI] [PubMed] [Google Scholar]
- [47].Shamir O and Zhang T, “Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes,” In International Conference on Machine Learning, pp. 71–79, 2013. [Google Scholar]
- [48].Codling EA, Plank MJ, and Benhamou S, “Random walk models in biology,” Journal of the Royal Society Interface, vol. 5, no. 25, pp. 813–834, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [49].Chen CP and Liu Z, “Broad learning system: An effective and efficient incremental learning system without the need for deep architecture,” IEEE Transactions on Neural Networks and Learning Systems, vol. 29, no. 1, pp. 10–24, 2017. [DOI] [PubMed] [Google Scholar]
- [50].Wang D, Gao X, and Wang X, “Semi-supervised nonnegative matrix factorization via constraint propagation,” IEEE Transactions on Cybernetics, vol. 46, no. 1, pp. 233–244, 2015. [DOI] [PubMed] [Google Scholar]
- [51].Lyu B and Haque A, “Deep learning based tumor type classification using gene expression data,” International Conference on Bioinformatics, Computational Biology, and Health Informatics, pp. 89–96, 2018. [Google Scholar]
- [52].Srivastava N, Hinton G, Krizhevsky A, Sutskever I, and Salakhutdinov R, “Dropout: a simple way to prevent neural networks from overfitting,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014. [Google Scholar]
- [53].Statnikov A, Wang L, and Aliferis CF, “A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification,” BMC Bioinformatics, vol. 9, no. 1, p.319, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [54].Statnikov A, Henaff M, Narendra V, Konganti K, Li Z, Yang L, Pei Z, Blaser MJ, Aliferis CF, and Alekseyenko AV, “A comprehensive evaluation of multicategory classification methods for microbiomic data,” Microbiome, vol. 1, no. 1, p.11, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [55].Wang S and Manning C, “Fast dropout training,” International Conference on Machine Learning, pp. 118–126, 2013. [Google Scholar]
- [56].Kingma DP and Ba J, “Adam: a method for stochastic optimization,” arXiv preprint, arXiv:1412.6980, 2014. [Google Scholar]
- [57].Barbieri CE, Baca SC, Lawrence MS, Demichelis F, Blattner M, Theurillat JP, White TA, Stojanov P, Van Allen E, Stransky N, and Nickerson E, “Exome sequencing identifies recurrent SPOP, FOXA1 and MED12 mutations in prostate cancer,” Nature Genetics, vol. 44, no. 6, pp. 685–689, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [58].Zhao X, Lei YI, Li GE, Cheng Y, Yang H, Xie L, Long H, and Jiang R, “Integrative analysis of cancer driver genes in prostate adenocarcinoma,” Molecular Medicine Reports, vol. 19, no. 4, pp. 2707–2715, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [59].Huang DW, Sherman BT, and Lempicki RA, “Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources,” Nature Protocols, vol. 4, no. 1, pp. 44–57, 2009. [DOI] [PubMed] [Google Scholar]
- [60].Barrett CS, Millena AC, and Khan SA, “TGF-β effects on prostate cancer cell migration and invasion require FosB,” The Prostate, vol. 77, no. 1, pp. 72–81, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [61].Vo BT, Morton D Jr, Komaragiri S, Millena AC, Leath C, and Khan SA, “TGF-β effects on prostate cancer cell migration and invasion are mediated by PGE2 through activation of PI3K/AKT/mTOR pathway,” Endocrinology, vol. 154, no. 5, pp. 1768–1779, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].Jung SJ, Oh S, Lee GT, Chung J, Min K, Yoon J, Kim W, Ryu DS, Kim IY, and Kang DI, “Clinical significance of Wnt/β-catenin signalling and androgen receptor expression in prostate cancer,” The World Journal of Men’s Health, vol. 31, no. 1, pp. 36–46, 2013. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [63].Kypta RM and Waxman J, “Wnt/β-catenin signalling in prostate cancer,” Nature Reviews Urology, vol. 9, no. 8, pp. 418–428, 2012. [DOI] [PubMed] [Google Scholar]
- [64].Yardy GW and Brewster SF, “The wnt signalling pathway is a potential therapeutic target in prostate cancer,” BJU International, vol. 98, no. 4, pp. 719–721, 2006. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.