Abstract
Motivation
Traditional regression models are limited in outcome prediction due to their parametric nature. Current deep learning methods allow for various effects and interactions and have shown improved performance, but they typically need to be trained on a large amount of data to obtain reliable results. Gene expression studies often have small sample sizes but high dimensional correlated predictors so that traditional deep learning methods are not readily applicable.
Results
In this article, we proposed peel learning, a novel neural network that incorporates the prior relationship among genes. In each layer of learning, overall structure is peeled into multiple local substructures. Within the substructure, dependency among variables is reduced through linear projections. The overall structure is gradually simplified over layers and weight parameters are optimized through a revised backpropagation. We applied PL to a small lung transplantation study to predict recipients’ post-surgery primary graft dysfunction using donors’ gene expressions within several immunology pathways, where PL showed improved prediction accuracy compared to conventional penalized regression, classification trees, feed-forward neural network and a neural network assuming prior network structure. Through simulation studies, we also demonstrated the advantage of adding specific structure among predictor variables in neural network, over no or uniform group structure, which is more favorable in smaller studies. The empirical evidence is consistent with our theoretical proof of improved upper bound of PL’s complexity over ordinary neural networks.
Availability and implementation
PL algorithm was implemented in Python and the open-source code and instruction will be available at https://github.com/Likelyt/Peel-Learning.
Supplementary information
Supplementary data are available at Bioinformatics online.
1 Introduction
Genes interact and regulate each other. Some of their complex relationship has been experimentally validated and summarized in hierarchical genetic regulatory networks. Within a network, connected genes are more likely to affect clinical outcomes together. It is challenging to model the joint effects of multiple correlated genes on an outcome. Traditional regression models are often limited assuming linear or fixed-form effects. As powerful alternatives, kernel-based regression models can accommodate flexible effects, but their applications in genetic studies have been focused on hypothesis testing rather than prediction (Liu et al., 2007; Wu et al., 2011).
Recently deep learning methods (LeCun et al., 2015) have been increasingly developed, evolved and applied to predict health outcomes from complex data (Goodfellow et al., 2016; LeCun et al., 2015; Schmidhuber, 2015). These data-driven methods typically assume little about predictor structures and usually train a fairly large amount of data to reach a satisfactory model. In most of gene expression studies, the sample sizes range from tens to hundreds but the numbers of interrelated genes are up to hundreds of thousands. It is challenging for conventional deep learning algorithms to make informative predictions with small samples. Fortunately, known genetic relationship can serve as prior knowledge to guide the learning process so that model searching space is shrunken and optimal models can be obtainable through small samples. Several classes of machine learning methods, including l1-norm regularization Lasso (Alelyani et al., 2013; Tibshirani, 1996), decision tree (DT) (Friedman, 2001) and XGboost (Chen and Guestrin, 2016) handle high-dimensional predictors. These methods have been extended for grouped or correlated predictors, including group Lasso (GL) (Tolosi and Lengauer, 2011; Zhao et al., 2009), tree structure group Lasso (TSGL) (Kamkar et al., 2015; Liu and Ye, 2010) and non-negative Max-heap (MH) (Liu et al., 2011). GL and TSGL use penalization functions of group or correlation matrix to encourage the selection of correlated yet sparse predictors (so-called ‘features’). MH uses a heredity principle defined by a tree structure to select children features only at the presence of parent features.
To handle high dimensional features, a class of sparsity-inducing or group-sparsity regularizations were introduced to neural networks to expedite gradient algorithms (Scardapane et al., 2017; Tartaglione et al., 2018). Another computationally intensive class, dropout neural network (DNN), uses a large number of different network architectures and randomly drop out (Srivastava et al., 2014) nodes during training, to reduce overfitting and improve generalization. The latter has shown more effective than the former (Goodfellow et al., 2016). Along the line of group-sparsity regularizers, PASNet incorporated prior knowledge of gene pathway into layer construction, allows sparsity between pathway and hidden layers, and assumes all within-pathway genes connected to its pathway node (Hao et al., 2018). Computationally, it applies sparse coding in randomly selected sub-networks. PASNet has shown better ROCs than DNN, and thus will serve as a proper comparison baseline.
In this article, we developed an easily implementable lightweight neural network called peel learning (PL) for leveraging predictor structures to predict outcomes. Our method considers predictor values and inherent predictor relationship as a combined input and performs prediction over layers as in neural network. In each layer, feature values within a local structure are transformed and summarized as an output feature value, and such local structures also evolve over layers to conserve the feature relationship.
Our proposed PL method has made several progresses: (i) it is built upon the relationship among genes (or predictor variable). Number of layers and local substructures are determined/fixed by the initial variable relationship, and each subsequent structure is inherited and trimmed hierarchically from the structure in the previous layer. This will greatly reduce the uncertainty and computation in searching for optimal layers and structures; (ii) PL uses a peeling principle and decomposes features into less dependent components, which sufficiently exploits the between-gene association and minimizes gene redundancy. The decomposition operation is built upon multiple subjects, which increases model robustness against individual variation; (iii) compared with PASNet that connects within-pathway genes to the same node, PL allows a more hierarchical and deeper structure for the within-pathway genes, which helps improve model complexity and prediction accuracy; (iv) in theory, we have proved that our PL algorithm has lower model complexity upper bound compared with two-layer ordinary neural network (NN) when feature structure is relatively sparse, which supports PL’s efficiency.
2 Materials and methods
2.1 Notation
Suppose there are n independent subjects in a study. For subject i , m gene expression predictors and an outcome yi are available. The goal is to learn an optimal model to predict Y using X. We use a column vector to denote the values of the gene expression, and another column vector to denote the outcome of all subjects.
If a gene Xj affects another gene Xk causally, i.e. the change of Xj leads to the change of we use a pair of indices (j, k) to denote such a directed link. A genetic pathway is an edge collection that contains all links or ‘edges’ among X. For convenience, E is stored in an m × m binary adjacency matrix E with elements being 1 if and 0 otherwise.
We use to denote the parent index set of Xk, to denote the child index set of Xk, and Jk and Sk to denote the number of parents and the number of children of respectively. Figure 1 shows a simple tree structure of 3 layers, contains 10 edges. Initially, X2 has a single parent X1 and two children X6 and X7, so has two parents X3 and X4 but no child, so and
Fig. 1.
An example of pathway structure evolution in PL. Each dash eclipse contains a local receptive tree (LRT). Within the LRT, parents and children predictors share some common components, as indicted by the same color. After the projection operation, the redundant components in children are removed and the remaining components, as indicated by a different color, are independent from their parent
For convenience, we summarize both expression values and pathway edges in a structure denoted as
Given its parents’ values, variable Xk is assumed to follow the distribution, , where is an independent random variable.
The distribution of outcome Y is assumed to follow:
| (1) |
where and can be any functions. Our goal is to estimate them through a deep learning procedure.
2.2 Peel learning (PL) method
PL considers both values of X and Xs’ relationship E, as summarized in a ‘structure’ . At layer 1, is the input and also a structure, is the output. If E is a DAG, the subsequent is also a DAG. At layer l, is the input and is the number of input variables. An output is generated as follows.
Step 1. For each variable/feature, we define a two-layer substructure formed by this feature and its children features and call it a local receptive tree (LRT). The original large structure will be ‘peeled’ into many such small structures, so that summary and tracing can be more computationally efficient. The links between features within each LRT are kept intact and will help recover the original structure after numerical operations. The ‘peeling’ algorithm has demonstrated efficiency in multi-generation pedigree studies where genetic data were summarized per small nuclear families and passed to earlier generations (Elston and Stewart, 1971).
Within each LRT, each feature is projected onto its parents’ space and the remaining component (so-called ‘residual’ in regression models) is extracted. Without loss of generality, we start with a linear projection, i.e.
| (2) |
where is a projection matrix, is a matrix with ordered columns and each column being values of a parent of feature k.
Compared with is less correlated with its parent(s) and thus all of its ancestral features. If the projection function matches the underlying relationship between parents and children (e.g. linear projection for linear relationship), and parents are expected to be independent/orthogonal. Furthermore, within a substructure, a child feature is only related to other child features through their parents, thus the residuals from all children are more independent of each other than before. The reduced dependency among these residuals would allow us to back-propagate more efficiently. The hat matrix H borrows other subjects’ information (over n) to infer the strength of parent–child association, which may also reduce bias due to individual variation.
For all children features from the same set of parents, the original estimated matrix only needs to be calculated once within one layer. Because the projection procedure can be done simultaneously for all feature variables, the computation addition over NN is minimal.
Step 2. For each child-bearing feature Xk, a linear combination of its value and its children’s residuals is generated and then transformed by the activation function to become a new ‘feature’, also the output value for this feature and the input for next layer. Specifically, is generated by
| (3) |
where transient variable is linear combination of and all ϵ, s are weight parameters, and is a bias parameter. The update of can be written in a matrix notation as:
| (4) |
where transient vector is weight vector is , intercept vector is and is an extended estimated matrix, defined in the following equation:
| (5) |
where is an identity matrix, is a diagonal matrix formed by blocks . We use Figure 1 to illustrate how steps 1 and 2 are performed. At layer 0, as the children of a single parent X2, X6 and X7 are projected/regressed on X2 to obtain ϵ6 and . Then a feature X2 in layer 1 is formed by . As a feature with multiple parents, X8 is projected on both X3 and X4 to obtain ϵ8 and then becomes the new feature for X3 in layer 1. The features with multi-generation parents are complicated, the value of X1 is formed by
A new edge collection is formed by keeping the indices of all child-bearing features and their links with parents in i.e. . So is a subset of . A childless feature does not form a summary by itself and thus does not appear in the next layer. This procedure is similar to trimming the bottom branches of a tree. A special exception applied to singletons (features without parent and child) that they will be reserved in each layer, i.e. the summary around a singleton is simply a function of itself.
Step 3. At the end of layer L, a small structure with a small number of features and edges remains. They will be used to generate , i.e. If the final features do not have any child, is simplified to Step 4. The final objective is to minimize a cost/loss function, e.g. for a continuous outcome. In each iteration, the values of W and b are updated by a revised propagation algorithm: (see Supplementary Material for derivations):
| (6) |
where η is the step size, , and δk is the derivative of the loss function with respect to zk, i.e. for . The value of is calculated per an updated backpropagation algorithm:
| (7) |
where is the element-wise ‘Hadamard’ product. For a given value of , left multiplying the weight matrix moves the error one step back to layer l. is the derivative function of the activation function with respect to .
Note that ordinary neural network (NN) only uses X as the input of each layer, ignoring E.
3 Algorithm
The PL procedure is summarized as follows:
1: Initialize: hierarchical graph D, training epochs T, learning step η, number of layers L, activation function σ, and convergence threshold γ.
2: Initialize: weights , biases .
3: for epoch t = 1 to T do
4: for layer l = 1 to L do
5: Based on structure project each child feature onto its parents’ space and generate residual values for each child. Then for each feature, combine its values and the remaining residual components of its children, i.e. where is defined in Equation 5.
6: Calculate new feature matrix
7: The new structure is automatically inherited from by removing childless features. The edge collection is in reduced dimensions:
8: Calculate the total cost and check whether the loss difference . If yes, algorithm stops; Else, continue to step 4.
9: Update weight and bias parameters and Continue with the next epoch.
For large data, the mini-batch process Hinton (2010) can be superimposed and the size of the mini-batches can be optimized in a training set. PL’s projection matrix requires a relatively larger size to be consistent across batches, compared with ordinary NNs.
3.1 Model complexity
NN aims to minimize the empirical risk , where is the training set, is the loss function. This can be any loss functions such as l2 loss or cross entropy loss. is a hypothesis with being a vector of all parameters.
With the same training error, simple solutions generalize better than complex solutions, which relies on whether the loss optimizer can converge toward low-complexity solution areas (Wu, 2017). Some theoretical work has focused on the geometry of the loss function around a global minimum (Wu, 2017; Zhang et al., 2017). The key is that the optimal solution needs to lie in relatively flat valleys (Hirsch et al., 1974) of the loss function to be able to generalize well.
We show that for a 2-layer PL, the expected value of the derivative with respect to the input X, which reflects the spatial fluctuation of f (Wu, 2017), are bounded (Supplementary Theorem S1 in Supplementary Material) and that the upper bound is smaller than that in the corresponding 2-layer NN (Supplementary Corollary S1 in Supplementary Material).
3.2 Time complexity
For a 2-layer PL, the time complexity is which is dominated by smn2 since we assume n2 is greater than , and s represents the average number of children for each feature. In contrast, the time complexity of a 2-layer NN is When the time complexity of PL is larger than that of NN; otherwise, the time complexity of PL is less. At last, PL has additional efficiency due to (i) more independent components, less singular matrix, faster convergence; (ii) less number of free parameters and (iii) robust to number of batches.
4 Implementation
4.1 Lung transplantation data
The data were collected through the Prospective Registry of Outcomes in Patients Electing Lung Transplantation study at The University of Pennsylvania between October 2011 and December 2017. Institutional review board approval and informed written consent were collected before enrollment. Before surgery, a 1 cm2 lung biopsy was obtained from the periphery of the lung using a mechanical stapler Anraku et al. (2008). Then after the transplantation reperfusion at the surgery for each recipient, a lung tissue sample was taken again. Samples were preserved and RNA was extracted using a standard protocol. Then gene expression was obtained using the Affymetrix Human Gene 2.1 ST Array, which has whole-transcript coverage for 40 716 total RefSeq transcripts and 24 838 genes Gellert et al. (2012). Raw expression values were quantile normalized and then summarized per gene level.
The primary outcome is binary primary graft dysfunction (PGD) (grade 3) in the 1st 48–72 h following lung transplantation Christie et al. (2005, 2010) (Yes or No). In this project, we intended to predict recipients’ PGD using the pre-operative donors’ expression within immunity-related genetic pathways. We used PL to derive the prediction and compared it with Lasso, GL, TSGL, MH, DT, XGboost, NN and PASNet.
Genetic regulatory pathways and model setup. We considered six immunology pathways from the Kegg Pathway Database: (i) Chemokine signaling pathway (185 genes); (ii) Toll-like receptor signaling pathway (102 genes); (iii) JAK-STAT signaling pathway (153 genes); (iv) Nod-like receptor signaling pathway (64 genes); (v) Graft versus host disease pathway (28 genes); (vi) Primary immunodeficiency pathway (34 genes). We used the genetic regulatory pathways published on GenomeNet Database Resources that has been functionally validated Kanehisa and Goto (2000); Kanehisa et al. (2010), for GL, TSGL, MH, PASNet and PL. See Supplementary Figures S7 and S8 for exemplary pathway structures. Across six pathways, the number of layers ranged from 4 to 11.
We used cross-entropy as the loss function, area under the receiver-operating-characteristic curve (AUC) and prediction accuracy (ACC) as the evaluation criteria. All model parameters were optimized through 5-fold cross-validation. The numbers of groups for tree-based and group-based methods were set to be 5, 10, 15, 20 and 25. The maximum iteration for GL, TSGL, MH was 100, 100, 100 and 5 and the maximum number of iterations was 10k, 10k, 10k and 50, respectively. The regularization ratio for the above methods was set to 1. All models’ hyperparameters were selected by grid search. The tree’s max depth for DT and XGboost ranged from 3 to 10. XGboost’s features’ sample rate was 0.8 and the sub-sample rate was 0.8. NN had 3 hidden layers and hidden layer’s neuron size was [500, 500, 10]. PASNet had 2 hidden layers and the first hidden layers’ size was the number of separate structures, the second hidden layer’s neuron size was 10. The batch size was 20 for NN and PASNet. PL’s hidden size and the number of hidden layer were determined by the feature structure. The default learning rate was 0.3 for PL. The final layer activation function for PL and NN is sigmoid. The clinical and expression data as described are available upon request.
4.2 Data analysis result
Among 113 enrolled subjects, 28 (24.8%) developed PGD at post-surgery 48–72 h [their characteristics in Cantu et al. (2020)]. Figure 2 shows the AUC and optimal ACC of PGD prediction from all methods using all gene expressions within each pathway. The standard errors are shown in Supplementary Material. Compared with Lasso and DT, PL consistently had 3.52% to 18.55% larger AUCs. GL, TSGL and MH had similarly and slightly larger AUCs than LASSO and DT, but 2.37% to 8.15% smaller than PL. PL outperformed PASNet by 3.09% to 15.40%, NN by 0.14% to 5.08%, except in the Jak Stat pathway, where NN had slightly better AUC than PL partially because JAK-STAT pathway had genes less connected. PASNet and XGboost had larger variation and smaller AUCs than PL. PASNet performed worse as the number of features become smaller, but there was no clear trend for XGboost.
Fig. 2.

AUC of nine models for the prediction of lung transplantation outcome PGD task using donors' gene expressionss within six different pathways
LASSO in average had worst ACC (5.61–15.08% lower than PL). GL, TSGL, MH, DT and XGboost were in the similar cluster with 3.62% to 6.62% smaller ACC than PL. PL outperformed NN and PASNet from 0.14% to 4.35% and from 3.39% to 17.84% separately, except in the JAK-STAT and graft versus host disease pathwnays. In JAK-STAT pathway with more loosely connected genes, NN had slightly larger AUC (0.94%) than PL; in graft versus host disease pathway with smaller number of genes, NN had larger ACC (1.10%) than PL. PL had comparable or smaller SD of AUC and ACC (Supplementary Tables S4 and S5).
Overall, PL had either optimal or sub-optimal AUCs or ACCs. As we expected, Lasso-like methods, including Lasso, GL, TSGL and MH, performed much worse than others because they extracted information on global or group levels. DT and XGboost performed slightly better because they incorporated multiple levels of the features variables. Neural network based methods, NN and PASNet, were the best among the baselines because in their layers, multiple non-linear functions were used to estimate the underlying function. PASNet used the pathway information through the connections among genes and pathway were uniform and generic. Instead, PL used more specific structures to incorporate specific gene-gene correlation strength and gradually remove the redundancy along the relational directions, so it is more efficient for neural network to extract valuable information.
Among the nine pathways, Toll-like receptor signaling pathway had the most prevailing evidence in association with lung transplant outcomes (Cantu et al., 2020), for which PL had noticeably better predictions than others (Fig. 3).
Fig. 3.

ACC of nine models for the prediction of lung transplantation outcome PGD using donors' gene expressions within six different pathways.
4.3 Simulation
We wanted to further understand how much model prediction can improve with specific predictors’ relationship, over NNs with no or uniform group structure. We conducted a series of simulations to compare PL with NN that does not use any feature correlation, and most closely related PASNet that considers fully connected feature within a group/pathway.
We considered three representative tree structures with one founder predictor and a fixed number of children predictors for each parent in Figure 4. The binary tree (s = 2) represents a simple structure with less dependency while the quinary (s = 5) and decimal trees (s = 10) have more dependency among feature variables. The total number of feature variables m was fixed. For a given m, the number of layers was pre-determined by the tree structure.
Fig. 4.
Binary, quinary and decimal tree structures in simulations. Side and bottom branches are skipped due to space
The values of the founder variable were simulated from the normal distribution The values of other variables were generated per atop-down structure. Specifically, the value of each child was generated from a normal distribution with the mean equals to a linear or non-linear (sin) function of its parents’ values. The strength of parent–child relationship was determined by R2, the percentage of variance in a child variable explained by its parents. R2 was set at 0.1, 0.3 and 0.5, respectively.
After all predictors were generated, 20 causal genes were selected and a binary outcome Y was generated from binomial distribution with a probability being where β is the logarithm of the odds ratio (OR) of Y for each unit increase of X. OR determined the strength of the parent–child association and were set at 1.1, 1.2 and 1.5, for weak, moderate and strong relation. α was chosen so that the probability of Y = 1 was about 50%.
We also varied the number of predictor variables m from 500, 1000, to 5000, and the sample size n from 200, 500, to 1000. We generated 50 datasets for each set of parameters. We chose tanh as the activation function to provide a stronger gradient than sigmoid functions for NN and PL. NN used 3 layers: [1000, 1000, 10]. PASNet used 3 layers: [1000, 1, 1], as recommended. PL automatically determined the number of layers per graph structure. The batch size was 100 for PL and 10 for NN and PASNet. The optimal learning rate was 0.05, 0.0001 and 0.1 for NN, PASNet and PL, respectively, to favor each algorithm. SGD was the optimizer for all methods. To avoid overfitting, we divided each dataset into training, validation and testing sets with an 8:1:1 ratio. We used MSE as the loss function. Smaller MSE values in the testing set generally indicate better prediction performance. We chose the initial learning rate from 0.01 to 1, 0.0001 to 0.1, 0.1 to 0.1, and then narrowed down to more precise values with an interval size of 0.01, 0.0001, 0.1 for NN, PASNet and PL separately.
We recorded CPU time of all runs for PL and NN, on an Intel Xeon E5-2660 v3 desktop equipped with 2.6 GHz CPU and 128 GB RAM. To understand the convergence of PL and NN, we also tracked the loss function values at each epoch within a random dataset using different step sizes, and the values across 50 different datasets using a fixed learning rate.
Last, we tested the robustness of PL when only partial edge information is available and some ambiguous directions were misspecified. In this setting, we randomly selected of the edges and assumed the opposite/wrong directions.
4.4 Simulation result
Table 1 shows the mean and SD of MSE of NN, PASNet and PL out of 50 simulations, for 1000 input variables and 200 subjects. PL had the smaller mean MSE consistently among the three methods regardless of the complexity of the input dependency structures. When X–X and X–Y relationships were linear. The MSEs in non-linear scenarios were larger than linear scenarios because of increased relation complexity.
Table 1.
Mean (SD) MSE of the prediction from NN, PASNet, and PL for 200 subjects (n) and 1000 input variables
| Layers | X-X, X-Y function | NN | PASNet | PL |
|---|---|---|---|---|
| 2 | Linear | 1.096 | 0.935 | 0.633 |
| (0.262) | (0.293) | (0.276) | ||
| Non-linear | 1.948 | 1.079 | 0.896 | |
| (0.845) | (0.020) | (0.358) | ||
| 5 | Linear | 0.699 | 1.012 | 0.399 |
| (0.203) | (0.278) | (0.120) | ||
| Non-linear | 1.439 | 1.228 | 1.156 | |
| (0.538) | (0.585) | (0.550) | ||
| 10 | Linear | 0.288 | 1.189 | 0.209 |
| (0.121) | (0.441) | (0.139) | ||
| Non-linear | 2.421 | 1.698 | 1.379 | |
| (1.280) | (0.532) | (0.569) |
Table 2 shows the mean and SD of MSE from NN and PL for different m and n. In general, as m decreased or n increased, the mean MSE decreased, implying a more precise prediction. At n = 200, PL had a mean MSE reduction of 10–21% compared with NN. However, as n increased, PL and NN resulted in similar MSEs. When both m and n were large, NN performed slightly better than PL.
Table 2.
Mean (SD) MSE of the prediction from NN and PL for different numbers of input variables (m) and sample sizes (n)
| 200 | 2 | 1.462 (0.374) | 1.321 (0.291) |
| 5 | 0.796 (0.206) | 1.446 (0.221) | |
| 10 | 0.934 (0.280) | 0.956 (0.132) | |
| 1000 | 2 | 1.948 (0.845) | 0.896 (0.358) |
| 5 | 1.439 (0.538) | 1.156(0.545) | |
| 10 | 2.421 (1.280) | 1.379(0.569) | |
| 5000 | 2 | 2.844 (1.291) | 0.942(0.457) |
| 5 | 1.968 (0.995) | 1.029 (0.550) | |
| 10 | 2.520 (1.566) |
1.320(0.873) | |
| 100 | 2 | 2.138 (0.864) | 1.012 (0.388) |
| 5 | 2.801 (1.226) | 1.330 (0.563) | |
| 10 | 1.887 (0.931) | 1.248 (0.536) | |
| 500 | 2 | 1.425 (0.571) | 1.149 (0.455) |
| 5 | 1.237 (0.524) | 0.967 (0.314) | |
| 10 | 1.245 (0.567) | 0.891 (0.524) | |
| 2000 | 2 | 0.538 (0.284) | 0.642 (0.449) |
| 5 | 0.487 (0.247) | 0.689 (0.210) | |
| 10 | 0.483 (0.232) | 0.629 (0.237) |
Table 3 shows the mean and SD of MSE in PL when the data structure was mis-specified. With more mis-specified directions (rate e increased), the prediction of PL became less accurate; however, as long as the link between feature variables was specified, the performance of PL was stable regardless of mis-specified directions except a scenario of simple binary tree and high mis-specification rate (40% or 50%).
Table 3.
Mean (SD) MSE of the prediction from PL with mis-specified directions, at mis-specification rates (e) of 0–50 with m = 5000 and n = 200
| s | X–X, X–Y relationship | Correct e = 0 | |||||
|---|---|---|---|---|---|---|---|
| 2 | Linear | 0.787 (0.277) | 0.973 (0.355) | 1.040 (0.307) | 1.882 (1.603) | 4.438 (2.379) | 5.551 (2.54) |
| Non-linear | 0.942 (0.457) | 1.312 (0.478) | 1.337 (0.679) | 1.329 (0.830) | 1.470 (0.978) | 1.967 (1.594) | |
| 5 | Linear | 0.614 (0.120) | 0.793 (0.437) | 0.946 (0.625) | 1.035 (0.579) | 1.087 (0.717) | 1.173 (0.924) |
| Non-linear | 1.029 (0.550) | 1.568 (0.675) | 1.426 (0.761) | 1.676 (0.711) | 1.613 (0.844) | 2.035 (2.253) | |
| 10 | Linear | 0.892 (0.495) | 1.084 (0.513) | 1.426 (0.752) | 1.540 (0.699) | 2.150 (2.039) | 2.735 (1.678) |
| Non-linear | 1.320 (0.873) | 1.653 (0.888) | 1.498 (0.597) | 1.886 (1.074) | 1.802 (0.814) | 2.283 (1.472) |
Supplementary Figure S5(a) and (b) shows the learning curves of PL and NN for 50 different datasets with a fixed leanring step η and the same initial weight and bias parameters. MSEs of PL always converged after a few iterations, while many MSE curves of NN decreased initially but increased afterwards, which suggested some overfitting. Supplementary Figure S5(c) and (d) shows the learning curves for a range of learning steps (0.1–1 for PL and 0.01–0.1 for NN) in a randomly selected dataset. The MSEs from PL were all under 5 and converged consistently toward 0 as the number of iterations increased, but the MSEs from NN varied often and the curves with medium step sizes performed the best. PL seems to be less sensitive to the choice of step size than NN.
Supplementary Figure S6 shows the mean and standard deviation of the CPU time of PL and NN runs. When the sample size, the number of predictors or the number of layers increased, the computation time of both PL and NN increased. The computation time of NN was doubled when n increased from 200 to 500, but the time increase of PL was minor. The difference between the linear and non-linear relationship was unnoticeable.
5 Discussion
As a proof-of-concept study, our study provided a straightforward option for prediction given structured features and was not designed to exhaustively improve all optimization steps. In real studies and simulations, PL outperformed other methods in almost all scenarios, regardless of sample size, variable number, linear or non-linear feature relationships. Even when relational directions were mis-specified, PL’s results remained robust. However, when having the relationship or not is inferred in mistake, the benefit of PL over NN might be hindered. So the proportion of correctly inferred branches might be the key factor for the superior performance of PL. PL performed similar to NN or PASNet when features were loosely or densely connected. At an extreme case where the features are independent, PL will become a NN with the same layers and the number of layers can be selected.
PL was able to handle high dimensional gene expressions within multiple pathways. Using all genes within the 6 pathways combined in the lung transplantation study, the average AUC and ACC () of PL models was 0.834 (±0.103) and 0.807 (±0.042), improved over the prediction models using any of the pathway alone.
We also proved that the upper bound of 2-layer PL’s model complexity is less than that of the 2-layer NN if the relational links are relatively sparse. It also means that for the same model complexity, PL requires smaller sample sizes to achieve the same performance as NN, which was empirically demonstrated in both real data and simulations. Computationally, even though PL has longer epoch time due to the introduction of the projection, it converges in less epochs.
In PL, features within substructure are decomposed through a linear projection, which can be easily modified to reflect non-linear nature. So far, the choice of linear projection is a fine balance between easy propagation and over-fitting. Within each LRT, the original values of parents rather their post-projection residuals were used for summary because using all residuals may amplify the effect of over projection and lead to sub-optimal performance.
We have focused on unilateral directional relationship but PL can be extended to undirected or even looped structures in the future. When the predictors’ relationship is unknown, the structure can be approximated using existing methods.
Like Bayesian methods, prior feature relationship plays a bigger role for PL’s performance in smaller samples. When sample size (n) is much larger than the number of predictors (m), simpler methods such as NN may work equally well as PL.
A class of methods for high dimensional predictors selected relevant predictors before running NN, such as, DNP (Liu et al., 2017), Diet Network (Romero et al., 2016) and Graph-Embedded Deep Feedforward Networks [GEDFN (Kong and Yu, 2018)]. They were all intended for predictors with undirected correlations. As a demonstration, we applied PL to predict breast cancer estrogen positive or negative subtype in TCGA BRCA project using the RNA-seq data of 1158 patients. The testing accuracy and AUC for 10 runs using the same substructure selected by GEDFN was 0.911 ± 0.013 and 0.928 ± 0.011, respectively, similar to the results from GEDFN (accuracy 0.902 ± 0.017 AUC 0.939 ± 0.016). We will conduct future studies to investigate the important topic of feature selection.
Strong high order interaction effects may require more layers in machine learning methods (Friedman and Popescu, 2008). The fixed number of layers in PL might limit the potential of detecting higher order interactions.
Funding
The authors' research is supported by NIH grants HL155821 (EC and RF), MH107571 (HR and RF), HL116656 (EC), HL115227 (EC) and AG051981.
Conflict of Interest: none declared.
Supplementary Material
Contributor Information
Yuantong Li, Department of Statistics, Purdue University, West Lafayette, IN 47907, USA.
Fei Wang, Department of Healthcare Policy and Research, Cornell University Weill Medical School, New York, NY 10065, USA.
Mengying Yan, Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710, USA.
Edward Cantu III, Department of Surgery, University of Pennsylvania, Philadelphia, PA 19104, USA.
Fan Nils Yang, Department of Neuroscience, Georgetown University, Washington, DC 20057, USA.
Hengyi Rao, Department of Neurology, University of Pennsylvania, Philadelphia, PA 19104, USA.
Rui Feng, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA.
References
- Alelyani S. et al. (2013) Feature selection for clustering: a review. Data Cluster. Algorithms Appl., 29, 110–121. [Google Scholar]
- Anraku M. et al. (2008) Impact of human donor lung gene expression profiles on survival after lung transplantation: a case-control study. Am. J. Transplant., 8, 2140–2148. [DOI] [PubMed] [Google Scholar]
- Cantu E. et al. (2020) Pre-procurement in situ donor lung tissue gene expression classifies primary graft dysfunction risk. Am. J. Respir. Critical Care Med., 202, 1046–1048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen T., Guestrin C. (2016) Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, 13–17 August 2016, pp. 785–794.
- Christie J.D. et al. ; ISHLT Working Group on Primary Lung Graft Dysfunction. (2005) Report of the ISHLT working group on primary lung graft dysfunction part I: introduction and methods. J. Heart Lung Transplant., 24, 1451–1453. [DOI] [PubMed] [Google Scholar]
- Christie J.D. et al. (2010) Construct validity of the definition of primary graft dysfunction after lung transplantation. J. Heart Lung Transplant., 29, 1231–1239. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elston R.C., Stewart J. (1971) A general model for the genetic analysis of pedigree data. Hum. Hered., 21, 523–542. [DOI] [PubMed] [Google Scholar]
- Friedman J., Popescu B. (2008) Predictive learning via rule ensembles. Ann. Appl. Stat., 2, 916–954. [Google Scholar]
- Friedman J. (2001) The Elements of Statistical Learning, Vol. 1. Springer Series in Statistics, New York. [Google Scholar]
- Gellert P. et al. (2012) Gene array analyzer: alternative usage of gene arrays to study alternative splicing events. Nucleic Acids Res., 40, 2414–2425. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goodfellow I. et al. (2016) Deep Learning, Vol. 1. MIT Press, Cambridge. [Google Scholar]
- Hao J. et al. (2018) Pasnet: pathway-associated sparse deep neural network for prognosis prediction from high-throughput data. BMC Bioinformatics, 19, 510. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hinton G. (2010) A practical guide to training restricted Boltzmann machines. Momentum, 9, 926. [Google Scholar]
- , Hirsch M.W. et al. (1974) Differential Equations, Dynamical Systems, and Linear Algebra, Vol. 60. Academic Press. [Google Scholar]
- Kamkar I. et al. (2015) Stable feature selection for clinical prediction: exploiting ICD tree structure using tree-lasso. J. Biomed. Inf., 53, 277–290. [DOI] [PubMed] [Google Scholar]
- Kanehisa M., Goto S. (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28, 27–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanehisa M. et al. (2010) KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res., 38, D355–D360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kong Y., Yu T. (2018) A graph-embedded deep feedforward network for disease outcome classification and feature selection using gene expression data. Ann. Appl. Stat., 34, 3727–3737. [DOI] [PMC free article] [PubMed] [Google Scholar]
- LeCun Y. et al. (2015) Deep learning. nature, 521, 436–444. [DOI] [PubMed] [Google Scholar]
- Liu B. et al. (2017) Deep neural networks for high dimension, low sample size data. In Proceedings of the 26th International Joint Conference on Artificial Intelligence pp. 2287–2293.
- Liu D. et al. (2007) Semiparametric regression of multidimensional genetic pathway data: least-squares kernel machines and linear mixed models. Biometrics, 63, 1079–1088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J., Ye J. (2010) Moreau-Yosida regularization for grouped tree structure learning. In NIPS'10: Proceedings of the 23rd International Conference on Neural Information Processing Systems, Vol. 2, pp. 1459–1467.
- Liu J. et al. (2011) Projection onto a nonnegative max-heap.In Proceedings of the 24th International Conference on Neural Information Processing Systems, pp. 487–495.
- Romero A. et al. (2016) Diet networks: thin parameters for fat genomics. arXiv preprint arXiv:1611.09340.
- Scardapane S. et al. (2017) Group sparse regularization for deep neural networks. Neurocomputing, 241, 81–89. [Google Scholar]
- Schmidhuber J. (2015) Deep learning in neural networks: an overview. Neural Netw., 61, 85–117. [DOI] [PubMed] [Google Scholar]
- Srivastava N. et al. (2014) Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15, 1929–1958. [Google Scholar]
- Tartaglione E. et al. (2018) Learning sparse neural networks via sensitivity-driven regularization. CoRR, abs/1810.11764.
- Tibshirani R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodological), 58, 267–288. [Google Scholar]
- Tolosi L., Lengauer T. (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics, 27, 1986–1994. [DOI] [PubMed] [Google Scholar]
- Wu L. (2017) Towards understanding generalization of deep learning: perspective of loss landscapes. arXiv preprint arXiv:1706.10239.
- Wu M.C. et al. (2011) Rare variant association testing for sequencing data with the sequence kernel association test (skat). Am. J. Hum. Genet., 89, 82–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang C. et al. (2017) Understanding deep learning requires rethinking generalization. In Proceedings of the 5th International Conference on Learning Representations, 45820.
- Zhao P. et al. (2009) The composite absolute penalties family for grouped and hierarchical variable selection. Ann. Stat., 37, 3468–3497. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.


