Summary
Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.
Keywords: integrative learning, horizontally partitioned data, knowledge-guided learning, network-based penalty, high-dimensional data
1 |. INTRODUCTION
Massive amounts of high-throughput -omics data that have been generated in recent studies offer great promises in deepening our understanding molecular underpinning and mechanisms for complex diseases such as Alzheimer’s disease and cancer. At the same time, they still present significant analytical challenges as the sample size in a single study is often small to moderate. There is a large body of literature on regularized regression models for the analysis of high-dimensional data in the setting where the number of variables is larger than the sample size. While many of these methods have appealing asymptotic properties, there is a growing recognition that their performance in practice is often unsatisfactory when the sample size is small and the signal-to-noise ratio is small. A number of approaches have been proposed to mitigate this limitation of regularized regressions, particularly for the analysis of genomics data.
One popular approach is to incorporate prior knowledge on high-dimensional predictors such as gene regulatory pathways and co-expression networks that are represented by graphs and can be obtained from public or commercial databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa et al. [13]) and Gene Ontology [2]. The knowledge-guided approach for structured data whose variables lie on a graph [15] has been adopted in supervised learning such as regression [14, 23, 26, 7] and in unsupervised learning [17, 18], through carefully designed penalty functions in a frequentist framework or prior specifications in a Bayesian framework. The rationale behind incorporating the graphical structure of features into supervised learning is the fact that phenotypic biomarkers are often manifested as a result of interaction between a group of genes (pathway). It is typically not the case that the important features are unrelated. Rather, one or more groups of closely related genes have the predictive power jointly. Therefore, the graphical information can be integrated by encouraging the group-wise selection of the model coefficients. For example, Li and Li [14], Pan et al. [23] propose network-based penalties which encourage joint selection of the predictors that are connected in the graph. Li and Zhang [15], Stingo et al. [24] use a Markov random field (MRF) prior combined with a spike and slab prior to encourage selection of connected predictors. More recently, Chang et al. [7] propose a structured shrinkage prior which mitigates some issues associated with prior Bayesian methods. While all the aforementioned methods use the predictor graph in an edge-by-edge manner and encourage selection of adjacent nodes, [26] proposed a method that uses the predictor graph in a node-by-node manner and encourage selection of the neighborhood group of each predictor. These knowledge-guided statistical learning methods have shown improved prediction accuracy and improved power for detecting weak yet important signals in finite samples and they encourage selection of pathways rather than individual features, leading to biologically more meaningful and interpretable results [30].
Another useful approach for mitigating the small sample size problem is integrative learning of multiple datasets that contain the same set of variables, also known as horizontally partitioned data. Multiple datasets are broadly defined as being collected from multiple studies/sites, from multiple sex/racial groups, or from multiple related disease groups. One key advantage of integrative learning in regression is that it improves the power for detecting important predictors that are shared across the datasets [29]. Ma et al. [22] proposed an integrative analysis approach that assumes the same sparsity structure of the regression coefficients across all datasets but allows for different effect sizes. The homogeneous sparsity assumption, however, can be overly restrictive in some applications. This assumption is relaxed in subsequent work by, among others, Li et al. [16], Liu et al. [20 21], Huang et al. [11] which allow for heterogeneity in sparsity structure across multiple datasets. However, these existing integrative learning methods do not account for important graph information for structured predictors such as genomics data, which has the potential to further improve the power of detecting weak yet important signals. Thus the existing heterogeneity models, as they allow the coefficient of a selected feature in some datasets to become zero, may miss such weak, yet important signals, weakening the power of integrative learning. To the best of our knowledge, there has been little work on incorporating graph information into integrative learning except for [19], and their approach relies on the assumption of homogeneous sparsity across all datasets, which may be unrealistic in many applications. For example, when performing integrative learning of datasets from populations at different stages of a disease, the set of important predictors may vary across these datasets.
To address this gap, we propose a novel integrative learning approach, called Structured Integrative Learning (SIL), which enables incorporation of structural information such as graphical knowledge on predictors. The key idea underlying SIL is that if a group of features as defined by pathways/networks are important in one dataset, they are likely to be important for the other datasets as well. Our approach is designed to select ‘groups of features’ jointly for all datasets rather than selecting ‘individual features’ jointly. As such, it is expected to further improve the power of detecting weak, yet important signals. Our proposed method can accommodate both homogeneous and heterogeneous sparsity structure, and in particular our method is theoretically justifiable. We show the oracle inequalities that provide the upper bounds of estimation and prediction errors in a non-asymptotic manner. We also investigate the conditions for the oracle property to hold in the setting where both the number of datasets and the number of predictors diverge. Another contribution of our work is to develop an iterative shrinkage-thresholding algorithm [3, 10] that fits our model, which is much more scalable than the (sub)gradient descent algorithm that has typically been used in prior work for integrative learning. We show that the proximal operators associated with our regularizers have analytic solutions and can be evaluated very efficiently.
We note that Chang et al. [8] has presented the intermediate results of our research on the proposed method. Compared to the earlier version, this work includes several significant improvements as follows. The current work presents a more general penalty formulation of which the penalty in the prior version can be viewed as a special case. Theoretical properties of the general penalty are rigorously investigated, while the earlier version includes no theoretical result. We include new regularizers that are based on log-sum penalty and the efficient algorithms for them. The regularizers in the prior paper are still included and compared in the simulation and data analysis studies. In the simulation study, we compare performance using both homogeneity data and heterogeneity data, while the earlier version only uses heterogeneity data. Moreover, we perform a sensitivity analysis which investigates robustness of our method against inaccurate and/or incomplete graphical information, while no sensitivity analysis in the earlier work. Finally, we include a pathway enrichment analysis in data analysis, which demonstrates our method yields outcomes that can be more interpretable and biological meaningful, while the prior work included no pathway enrichment analysis.
The remainder of this article is organized as follows. We describe the problem of interest and present our proposed method in Section 2, and then present the numerical algorithm in Section 3. In Section 4, we present the theoretical properties of the proposed method. In Section 5, we conduct simulation studies to evaluate the performance of our approach in comparison with several existing methods. In Section 6, we further illustrate the strengths of the proposed method through analysis of real data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). We conclude the paper with some discussion remarks in Section 7.
2 |. METHOD
2.1 |. Background
To fix ideas, consider fitting linear regression model using data from M datasets. In the m-th dataset, we have an nm ×p predictor matrix Xm and an nm × 1 response vector ym, where nm is the sample size of the m-th dataset and p is the number of predictors. Let be the total sample size. The model of interest is the linear model
| (1) |
where is the p × 1 true coefficient vector and is the nm × 1 error vector for the m-th dataset. The regularized least square loss function is generally given by
where B = [β1 ⋯ βM] is the p × M coefficient matrix, P(B) is a penalty on B, and
The most general estimator would allow all βm to be different and use a separable penalty . This is equivalent to independently minimizing
which is equivalent to analyze each dataset separately. We call this model the fully heterogeneous model. On the other hand, the least general estimator would assume all coefficients to be the same across all datasets; βm ≡ β. This fully homogeneous model is equivalent to merging all datasets with each data point weighted by the reciprocal of the size of the dataset it belongs to. The weights prevent the large dataset from dominating the loss function and keep the coefficients from leaning favorably only to the large dataset.
Obviously, the full homogeneity assumption can be overly restrictive. Each dataset often has its own characteristics, and the association between the outcome and the predictors can be different. Ignoring the difference can lead to poor or suboptimal performance in estimation and prediction. On the other hand, the fully heterogeneous model fails to borrow information across datasets, and the regression model for each dataset can suffer the curse of dimensionality. The motivation of integrative learning is to aggregate common information from multiple datasets while accounting for heterogeneity across these datasets.
2.2 |. Structured Integrative Learning
In this work, we focus on the case where the graph information for predictors is the same across the M datasets. Denote by G = ⟨𝑉,E⟩ the graph on predictors X where 𝑉 = {1, ... , p} is the set of features and E is the set of edges between the features. Let A = [𝑎jk] be the adjacency matrix associated with G and let be the neighborhood of the j-th feature including itself. Let e = |E| be the number of edges in G and be the number of members in . The graphical information on features often represents the partial correlation structure of the features. That is, the presence of an edge between features j and k implies the (j, k) entry of the feature precision matrix is nonzero, while an absence of edge means a zero entry. In analysis of genomics data, such graphs often represent gene regulatory pathways or co-expression networks which can be obtained from existing databases such as KEGG [13].
For Model (1), note that we have βm = Ωmcm where Ωm = nmE(Xm𝑇Xm)−1 and . This yields
| (2) |
where is the j-th column of Ωm. Since the absence of an edge implies a zero partial correlation, we have . Following the observation from Yu and Liu [26], either can have nonzero values (if ), or there will be no contribution from the j-th group to the effect size βm (if ). The key premise of our work is that if a group is important for one dataset, it is likely to be important for other datasets as well. To encourage joint selection of the feature groups across all datasets, we propose the following penalty.
| (3) |
where in an arbitrary p × M matrix.
Note that the standard group lasso penalty such as where and ‖ · ‖F is the Frobenius norm has an undesirable characteristic. Since are overlapping in general, it yields if and only if all groups including j are selected. In other words, gene j can be selected if and only if all groups (pathways) including j are important, which is not plausible in biology. On the other hand, the proposed penalty (3) introduces latent coefficients Γj and can yield if at least one group that contains j is selected. That is, if a pathway is important, then all genes in the pathway can become important. This is a required property to be consistent with (2). More details about the latent group lasso penalty can be found in Jacob et al. [12]. It is easy to see that applying the proposed penalty is equivalent to replacing βm with in the loss function and enforcing a penalty on Γ = (Γ1, ... , Γp) as follows.
Minimizing L(Γ) + P(Γ) will yield the same solution as minimizing L(B) + P(B) with . The efficient algorithms for selected penalties will be presented in Section 3.
We propose the core penalty function ρ(Γj) has the following form.
| (4) |
The outer penalty determines the sparsity and unbiasedness levels and the positive halves of all popular concave penalty functions can be used; for example, lasso [25], SCAD [9], MCP [27], and log-sum [6, 1]. The inner penalty determines how the penalties of the columns are combined, and yields different type of models. The choice of ρ2(Γ) = ‖Γ‖F selects all components in an all-in-or-all-out fashion and leads to the homogeneity model, but the choice of ρ2(Γ) = ‖Γ𝑇‖2,1 (L2,1 norm) allows each to become zero and leads to the heterogeneity model. The combination of both ρ2(Γ) = α‖Γ‖F + (1 − α)‖Γ𝑇‖2,1 can also be considered.
For example, we can choose the log-sum penalty for ρ1.
| (5) |
Then, depending on the choice for ρ2, we can have two types of penalties
| (6) |
In both choices, the entries of becomes zero or nonzero in an all-in-or-all-out fashion. The weights τj should take into account the size of the group and thus is recommended to have the form for some dj > 0. We can have homogeneous penalty (dj = 1), or we can reflect the importance of the group by choosing adaptive penalty dj. Noting that, for example, the coefficients are proportional to the absolute correlation between the response variable and the predictor j as in (2), dj can be chosen to be reciprocal to the average of the absolute sample correlations .
Note that the L1 or L2 norm based penalty discourages to include highly correlated variables/groups simultaneously. In other words, true signals can be pushed away from the model by its highly correlated neighbors, and it is the case in our model as well. Due to the construction of , some ‘s are often similar or can even be identical. As a remedy to this issue, we follow the idea of the elastic net [32], which adds the ridge penalty.
The ridge penalty seemingly reduces the correlations between groups and facilitates inclusion of all potentially important signals. This completes the objective function of our model.
or equivalently
where
Many existing penalties can be viewed as a special case of ours. If we have only one dataset and we choose ρ1(x) = x and ρ2(Γ) = ‖Γ‖F, then our method simplifies to the sparse regression incorporating graphical structure among predictors [26]. If no edge exists or no graphical information is available, we have for all j. In this case, if we choose the minimax concave penalty for ρ1, our method reduces to the methods introduced in Liu et al. [21]. If we choose ρ1(x) = x and the Frobenius norm for ρ2, our model simplifies to the integrative analysis model with the group lasso penalty proposed by Ma et al. [22].
3 |. ALGORITHM
In this section, we present an efficient algorithm to fit our model with the log-sum penalties defined in (6), which will enhance its usefulness in analysis of high-dimensional data such as genomics data. We also consider two other penalties in this work. Instead of (5), we can use the MCP penalty [27] for ρ1.
| (7) |
Or, we can have the convex penalty
| (8) |
The algorithms for (7) and (8) can be found in Chang et al. [8].
Let and where is the vector of unconstrained coefficients in . Let be the submatrix of Xm including the columns corresponding to and let . Denoting Δ = (Δ1, ... , Δp), our objective function can be decomposed into a differentiable part L(Δ)+ PR(Δ) and a non-differentiable part or where
We use the accelerated proximal gradient descent algorithm (FISTA, Beck and Teboulle [3]) to fit our models. While the log-sum penalty is not convex, its second derivative is bounded from below and it satisfies the criteria in Gong et al. [10]. Propositions 1 and 2 describe how to evaluate the proximal operators for and , respectively. Let be the proximal operator associated with penalty P(Δ) evaluated at Δ, as defined below.
Proposition 1.
For t < η∕(λ maxj τj), the proximal operator associated with the penalty is given by
| (9) |
where
| (10) |
Proposition 2.
For t < η∕(λ maxj τj), the proximal operator associated with the penalty is given by
| (11) |
where hj satisfies
| (12) |
The proofs for Propositions 1 and 2 are included in Web Appendix A. Note that (12) is a piecewise quadratic equation in hj whose analytic solution can be easily obtained as follows. Let . Equation (12) can be rewritten as
| (13) |
Sort in ascending order and assume, for simplicity, that
for some K ≤ M. First, note that hj = 1 if and only if . Suppose and . From (13), we have the candidate solution hkj as follows.
If for some k ∈ {1, ... , K}, it is indeed the solution for (12). Otherwise, is the solution for (12).
The algorithm uses the standard accelerated proximal gradient descent algorithm with the backtracking line search. Each iteration requires for and for . We have also investigated the non-accelerated proximal gradient descent algorithm and found that the accelerated version has a substantial advantage when the sample size is small and the ridge penalty λR is 0 or close to 0.
4 |. THEORETICAL PROPERTIES
In this section, we study the theoretical properties of the proposed method. The main goal is to provide the conditions for which the oracle inequality and the oracle property hold in the context of integrative analysis. Although the Theorem statements and the proofs may look similar to those in Yu and Liu [26], the implications apply to the analysis of a large number of datasets. Also, note that the result of -consistency presented here (Theorem 3) is more general than that of Yu and Liu [26], as the oracle property therein is discussed with fixed p only.
Let be the set of important variables of the mth dataset and be the union of all important variables. Define s0 = |J0| as the number of all important variables. Let be the set of groups which contain important features only, be the set of groups which contain at least one important gene, and be the set of groups which contain unimportant features only, and let s1 = |J1|, s2 = |J2|, and s3 = |J3|. Focusing on general penalties with ρ(⋅) ≥ ‖ ⋅ ‖F, we first present the oracle inequalities under homogeneous penalty weights dj = 1, i.e., for j = 1, ... , p. These are non-asymptotic finite sample properties which account for a diverging number of datasets and predictors. Then, we discuss the model selection consistency and the asymptotic normality under adaptive penalty weights dj. To this end, define and . Noting that the ridge penalty is not required for the oracle properties to hold and only needed to be small enough, we set λR = 0 in this section.
For simplicity, we assume nm = n for m = 1, ... , M and thus N = Mn, and define
We vectorize Γj in this section, γj = vec(Γj), and the penalty is written as
with a little abuse of notation; ρ(γj) ≡ ρ(Γj). Then, the objective function goes as follows.
| (14) |
Let be the minimizer of (14) and be an optimal decomposition of .
We present the oracle inequalities for estimation and prediction errors. Let be the largest empirical variance of predictors. Let β0 = vec(B0) be the stacked true regression coefficients and be the stacked nonzero true regression coefficients. Let be the smallest absolute value of nonzero true coefficients across all datasets. For , let , be the set of all optimal decompositions of β, and Kτ(β), be the number of nonzero γj’s in the optimal decomposition of β which has the minimal number of nonzero γj ‘s, i.e., . Denote . We can check J1 = J0, J2 = J0, , and Kτ = s0 if the graph G has no edge. We need the following assumptions.
Assumption 1.
The important feature set J0 is covered by . That is, .
Assumption 2.
The errors for m = 1, ... , M.
Assumption 3.
There exists a constant κ > 0 such that
where is the set of all optimal decompositions Γ = (γ1, ... , γp) of β such that .
In order to select the correct model, the groups that include any unimportant variable must not be selected and only the groups that have important variables only may be selected. Assumption 1 ensures that all important variables are covered by the groups with important variables only. Although we assume Gaussian errors in Assumption 2, the asymptotic properties presented in this paper hold for any iid mean zero sub-Gaussian errors. Assumption 3 is similar to the restricted eigenvalue condition or the compatibility condition [5] which is commonly used for these types of inequalities but has been tailored to our proposed penalty.
Theorem 1.
(Oracle inequalities) Suppose Assumptions 1, 2, and 3 hold. Assume ρ(γ) is a norm such that ρ(γ) ≥ ‖γ‖2. Let dj = 1, i.e., for j = 1, ... , p. If we choose for some A > 0, then the following inequalities hold with probability at least 1 − 2exp(−A∕2).
Please see Web Appendix B for proofs. Note that the results of Theorem 1 are general and consistent with the results shown in existing literature. For example, if M = 1 and we choose ρ1(x) = x and ρ2(Γ) = ‖Γ‖F, we obtain the same results as in Yu and Liu [26]. If, in addition, there is no edge in the graph, we obtain the results similar to Bickel et al. [5].
We now present the oracle property focusing on the homogeneity model ρ(γ) = ‖γ‖2. The objective function can be written in terms of Γ as follows.
| (15) |
Let be the minimizer of (15) and be the solution. Let represent the set of subsets of which covers the important variables J0. That is, if and only if Define and . This set is not empty due to Assumption 1. Note that we have 𝑆0 = s0 if the graph G has no edge. Let Q2 > 0 be the smallest eigenvalue of and let . In Theorems 2 and 3, we present low level conditions required for model selection consistency and asymptotic normality, respectively. In Corollaries 1 and 2, we list conditions for individual parameters required for the oracle property, which will depend on the adaptivity of the penalty weights dj.
Theorem 2.
(Model selection consistency) Suppose Assumptions 1 and 2 hold. Consider ρ(γ) = ‖γ‖2. If
| (16) |
then we have with probability tending to 1.
Remark 1.
The first two terms in (16) control the deviation of the nonzero coefficients from their ground truth. The last two terms in (16) ensure the penalties are large enough to suppress the coefficients of unimportant predictors.
Our method also possesses the property of asymptotic normality. However, in order to have -consistency, we need a stronger condition compared to the model selection consistency.
Theorem 3.
(Asymptotic normality) Assume the conditions in Theorem 2, and further assume
| (17) |
Let for any sequence of nonzero vector 𝜶 with length MJ0. Then, we have
We now investigate conditions for individual factors which guarantee the oracle property.
Assumption 4.
𝑆0 ≍ s0 ≍ nα where 0 ≤ α < 1.
Assumption 5.
𝑄1 ≍ 𝑄2 ≍ ξ ≍ 1.
Assumption 6.
.
Assumption 7.
λ = (n−(1+α)∕2) and d∗ ≍ 1.
The number of important variables must typically be less than the sample size. This is also connected in part to the condition on the smallest eigenvalue 𝑄2 of as in Assumption 5. The predictors can always be standardized, so we can have 𝑄1 ≍ 1 as well. The assumption ξ ≍ 1 is similar to but weaker than the irrepresentable condition [28] since the bound needs not be less than 1. We consider the signal-to-noise ratio fixed at a constant level. Therefore, and Assumption 6 are plausible. Assumption 7 sets a penalty cap which limits the bias for important variables caused by the penalty. The conditions for M, p and the lower bound of λ depend on the minimum adaptive penalty weights d∗ on the unimportant variables.
Corollary 1.
(Strongly adaptive penalty weights) Suppose Assumptions 4–7 hold. If d∗ ≍ Nγ∕2 with γ ≥ 1, then the conditions (16) and (17) are satisfied if
If the adaptive penalty weights for unimportant variables are chosen at a rate of or higher, the number M of datasets our method can accommodate for the oracle property only depends on the number of important variables, and we can have exponentially growing number of datasets with respect to n raised to a certain power. However, we find that, if the penalty weights are weakly adaptive, which means the minimum adaptive penalty weights for unimportant variables are chosen at a rate lower than , our method may only accommodate polynomially increasing number of datasets with respect to n.
Corollary 2.
(Weakly adaptive penalty weights) Suppose Assumptions 4–7 hold. If d∗ ≍ Nγ∕2 with α < γ < 1, then the conditions (16) and (17) are satisfied if
It is worth noting that while the oracle inequality (Theorem 1) holds with the convex penalty and no adaptation (dj = 1), the oracle property (Theorems 2 and 3) requires an adaptive penalty. This result is consistent with the behavior of the ordinary lasso regression. The L1 penalty can achieve the oracle inequality [4], but cannot achieve the oracle property without further assumptions [28]. The adaptive lasso [31] or non-convex penalties [9] can achieve the oracle property.
5 |. SIMULATION
We conduct a simulation study to evaluate the performance of our method compared to existing integrative learning methods that do not incorporate graph information. We compare fully heterogeneous (FHT; independent estimation and tuning) models, integrative homogeneity (IHM) models, and integrative heterogeneity (IHT) models. IHM and IHT refer to the homogeneity model and the heterogeneity model, respectively, as defined in Zhao et al. [29]. We denote our SIL methods by SIL-Lasso, SIL-MCP, and SIL-LS, which use (8), (7), and (5) for ρ1, respectively. The heterogeneity SIL-Lasso uses (8) for ρ2 while fixing α = 1 for its homogeneity version. The homogeneity versions of SIL-MCP and SIL-LS use ρ2(Γ) = ‖Γ‖F and the heterogeneity versions use ρ2(Γ) = ‖Γ𝑇‖2,1.
The FHT competing models include Lasso [25], Enet [32], and SRIG [26], the IHM competitors include L2 gMCP [21] and gLasso [22], and the IHT competitors include L1 gMCP [21] and sgLasso (sparse gLasso), which uses
We describe how to generate the precision matrix of Xm. For each m = 1, ... , M, we generate a block diagonal matrix
where each sub-matrix is a pB × pB symmetric matrix. We consider three different types of graphical structure for depending on scenarios. The detailed procedure goes as follows.
Set to a pB × pB zero matrix.
-
Depending on scenarios, generate the nonzero lower triangular entries specified below as .
Scenario 1 (ring type): and for k > 1 are nonzero.
Scenario 2 (hub type): for k > 1 are nonzero.
Scenario 3 (random type): Each is nonzero with probability 3∕pB.
Fill in the upper triangular entries;.
.
Normalize such that the diagonal elements of its inverse matrix become 1.
The true regression coefficients βm is given by
for some vector 𝜶. To create heterogeneity, the second block of features is set to have no influence on the outcome variable with probability pht. That is, we have
For each scenario, each row of Xm is independently sampled from . Then, the responses are generated by a linear model as follows.
where . We generate a total of N = nM observations with each dataset assigned n samples.
We consider M = 5 datasets with p = 100 features (B = 10, pB = 10). The error variance is σ2 = 1 and we use α = [ 1 1∕3 ⋯ 1∕3]T for scenarios 1 and 3 and α = [ 1 1∕4 ⋯ 1∕4]T for scenario 2. This yields roughly 2.5 signal-to-noise ratio for all scenarios. We use the validation method for tuning our methods. The tuning parameters are selected simultaneously via a grid point search over the multi-dimensional tuning parameter space. For example, IHM-SIL-LS for Scenario 1 in Table 1 searches over the 25 × 10 × 6 points of (λ, η, λR). The tuple that minimizes the validated prediction error was selected and used for predicting the testing data. The training sample size is n = 200, the validation sample size is nυ = 200, and the testing sample size is nt = 1000. Every method is fitted for a total of 100 replicates and tuned by the validation method. In Tables, we report the simulation results evaluated by the mean squared prediction error (MSE), the average L2 distance between the estimated coefficients and the true coefficients, the false positive rates (FPR), and the false negative rates (FNR).
TABLE 1.
Simulation results for homogeneity data. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, ; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L2; average L2 distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.
| Type | Method | Scenario 1 | Scenario 2 | Scenario 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | L 2 | FPR | FNR | MSE | L 2 | FPR | FNR | MSE | L 2 | FPR | FNR | |||
|
| ||||||||||||||
| FHT | Lasso | 1.274 (.004) | 1.509 (.010) | 0.330 (.005) | 0.238 (.005) | 1.348 (.004) | 1.902 (.014) | 0.402 (.005) | 0.234 (.006) | 1.274 (.004) | 1.539 (.014) | 0.313 (.005) | 0.297 (.008) | |
| Enet | 1.274 (.004) | 1.509 (.010) | 0.330 (.005) | 0.238 (.005) | 1.348 (.004) | 1.902 (.014) | 0.402 (.005) | 0.234 (.006) | 1.274 (.004) | 1.539 (.014) | 0.313 (.005) | 0.297 (.008) | ||
| SRIG | Y | 1.156 (.004) | 1.052 (.008) | 0.125 (.003) | 0.112 (.003) | 1.157 (.003) | 1.269 (.014) | 0.100 (.005) | 0.000 (.000) | 1.194 (.005) | 1.223 (.015) | 0.208 (.006) | 0.089 (.004) | |
|
| ||||||||||||||
| IHM | gLasso | 1.185 (.004) | 1.262 (.010) | 0.573 (.008) | 0.026 (.003) | 1.255 (.004) | 1.670 (.013) | 0.718 (.008) | 0.010 (.002) | 1.186 (.004) | 1.285 (.014) | 0.576 (.009) | 0.096 (.007) | |
| L2 gMCP | 1.157 (.005) | 1.116 (.019) | 0.156 (.015) | 0.231 (.015) | 1.164 (.004) | 1.097 (.017) | 0.152 (.014) | 0.056 (.007) | 1.144 (.004) | 1.070 (.016) | 0.161 (.016) | 0.254 (.012) | ||
| SIL-Lasso | Y | 1.099 (.004) | 0.838 (.018) | 0.187 (.019) | 0.006 (.002) | 1.130 (.003) | 1.011 (.013) | 0.169 (.016) | 0.000 (.000) | 1.109 (.004) | 0.916 (.014) | 0.229 (.017) | 0.009 (.002) | |
| SIL-MCP | Y | 1.109 (.004) | 0.911 (.020) | 0.092 (.015) | 0.037 (.005) | 1.120 (.003) | 0.923 (.014) | 0.045 (.009) | 0.000 (.000) | 1.111 (.003) | 0.935 (.014) | 0.122 (.014) | 0.030 (.004) | |
| SIL-LS | Y | 1.102 (.004) | 0.864 (.018) | 0.119 (.017) | 0.028 (.004) | 1.120 (.003) | 0.921 (.014) | 0.066 (.013) | 0.000 (.000) | 1.107 (.004) | 0.912 (.013) | 0.115 (.015) | 0.033 (.005) | |
|
| ||||||||||||||
| IHT | sgLasso | 1.194 (.004) | 1.308 (.014) | 0.536 (.016) | 0.034 (.004) | 1.266 (.004) | 1.706 (.017) | 0.689 (.013) | 0.021 (.004) | 1.194 (.004) | 1.318 (.016) | 0.552 (.015) | 0.105 (.008) | |
| L1 gMCP | 1.190 (.006) | 1.260 (.017) | 0.052 (.005) | 0.348 (.009) | 1.190 (.006) | 1.140 (.026) | 0.065 (.005) | 0.136 (.012) | 1.170 (.006) | 1.164 (.020) | 0.068 (.006) | 0.345 (.011) | ||
| SIL-Lasso | Y | 1.105 (.004) | 0.852 (.016) | 0.204 (.019) | 0.011 (.002) | 1.135 (.003) | 1.044 (.014) | 0.159 (.014) | 0.000 (.000) | 1.115 (.004) | 0.936 (.014) | 0.242 (.018) | 0.015 (.003) | |
| SIL-MCP | Y | 1.127 (.004) | 0.991 (.019) | 0.045 (.007) | 0.072 (.005) | 1.128 (.005) | 0.957 (.028) | 0.031 (.006) | 0.000 (.000) | 1.130 (.004) | 1.022 (.015) | 0.073 (.009) | 0.068 (.006) | |
| SIL-LS | Y | 1.116 (.004) | 0.923 (.018) | 0.061 (.009) | 0.060 (.004) | 1.125 (.004) | 0.944 (.015) | 0.046 (.010) | 0.000 (.000) | 1.123 (.004) | 0.974 (.014) | 0.084 (.010) | 0.064 (.006) | |
In Table 1, we consider the case where all datasets have homogeneous sparsity structure, i.e., pht = 0. For all scenarios, the integrative approaches (IHM and IHT) tend to yield better performance than the fully heterogeneous methods (FHT), as the integrative approaches are able to take advantage of the common sparsity structure of coefficients, except when the graph information is incorporated (SRIG). Since our methods also uses the graphical knowledge, they have clearly improved performance than other existing integrative learning methods. Particularly, our three IHM methods show the best performance among all. Although our IHT versions seem to lose a little more weak signals than our IHM versions do, it is still much less serious than other existing IHT methods. This demonstrates the advantages of incorporating network information into integrative learning.
In Table 2, some datasets can have different sparsity with pht = 0.3. We observe similar performance patterns as in Table 1. Although all methods pose slightly worse FPR and FNR compared to Table 1 due to the heterogeneity in sparsity structure of coefficients, our methods still have substantially improved variable selection performance over the ones with no graph incorporation or the non-integrative learning methods. It is particularly worth noting that existing IHT methods show more worse results compared to the homogeneity data (Table 1) than our IHT methods do. This again confirms the advantages of incorporating network information into integrative learning.
TABLE 2.
Simulation results for heterogeneity data. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, ; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L2; average L2 distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.
| Type | Method | Scenario 1 | Scenario 2 | Scenario 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | L 2 | FPR | FNR | MSE | L 2 | FPR | FNR | MSE | L 2 | FPR | FNR | |||
|
| ||||||||||||||
| FHT | Lasso | 1.245 (.005) | 1.427 (.011) | 0.295 (.005) | 0.247 (.005) | 1.310 (.005) | 1.821 (.015) | 0.357 (.006) | 0.257 (.007) | 1.244 (.005) | 1.458 (.015) | 0.278 (.005) | 0.313 (.008) | |
| Enet | 1.245 (.005) | 1.427 (.011) | 0.295 (.005) | 0.247 (.005) | 1.310 (.005) | 1.821 (.015) | 0.357 (.006) | 0.257 (.007) | 1.244 (.005) | 1.458 (.015) | 0.278 (.005) | 0.313 (.008) | ||
| SRIG | Y | 1.135 (.004) | 0.967 (.011) | 0.112 (.003) | 0.115 (.004) | 1.132 (.004) | 1.149 (.016) | 0.094 (.006) | 0.000 (.000) | 1.164 (.005) | 1.120 (.015) | 0.193 (.006) | 0.088 (.005) | |
|
| ||||||||||||||
| IHM | gLasso | 1.178 (.004) | 1.238 (.010) | 0.576 (.008) | 0.030 (.004) | 1.243 (.004) | 1.641 (.013) | 0.706 (.008) | 0.016 (.003) | 1.179 (.004) | 1.262 (.013) | 0.570 (.009) | 0.099 (.007) | |
| L2 gMCP | 1.159 (.005) | 1.123 (.021) | 0.157 (.015) | 0.265 (.016) | 1.169 (.006) | 1.152 (.024) | 0.209 (.016) | 0.080 (.011) | 1.147 (.004) | 1.090 (.018) | 0.180 (.018) | 0.281 (.015) | ||
| SIL-Lasso | Y | 1.105 (.005) | 0.860 (.020) | 0.217 (.019) | 0.013 (.003) | 1.137 (.004) | 1.064 (.022) | 0.215 (.020) | 0.003 (.002) | 1.113 (.004) | 0.938 (.017) | 0.266 (.018) | 0.014 (.003) | |
| SIL-MCP | Y | 1.113 (.004) | 0.913 (.021) | 0.123 (.015) | 0.049 (.006) | 1.127 (.005) | 0.980 (.025) | 0.073 (.010) | 0.003 (.002) | 1.113 (.004) | 0.945 (.018) | 0.140 (.014) | 0.034 (.004) | |
| SIL-LS | Y | 1.107 (.004) | 0.884 (.020) | 0.137 (.016) | 0.039 (.005) | 1.126 (.005) | 0.970 (.025) | 0.085 (.012) | 0.003 (.002) | 1.110 (.004) | 0.926 (.017) | 0.156 (.016) | 0.033 (.005) | |
|
| ||||||||||||||
| IHT | sgLasso | 1.195 (.004) | 1.311 (.016) | 0.499 (.019) | 0.061 (.007) | 1.260 (.005) | 1.706 (.021) | 0.629 (.016) | 0.051 (.008) | 1.196 (.005) | 1.335 (.019) | 0.500 (.019) | 0.145 (.012) | |
| L2 gMCP | 1.191 (.006) | 1.246 (.020) | 0.073 (.006) | 0.369 (.012) | 1.192 (.006) | 1.197 (.029) | 0.079 (.005) | 0.192 (.013) | 1.166 (.005) | 1.164 (.023) | 0.071 (.005) | 0.381 (.012) | ||
| SIL-Lasso | Y | 1.107 (.005) | 0.862 (.022) | 0.214 (.018) | 0.016 (.003) | 1.128 (.004) | 1.033 (.018) | 0.179 (.017) | 0.001 (.001) | 1.119 (.006) | 0.952 (.021) | 0.249 (.018) | 0.025 (.004) | |
| SIL-MCP | Y | 1.129 (.004) | 0.987 (.019) | 0.070 (.007) | 0.091 (.008) | 1.126 (.005) | 0.957 (.026) | 0.067 (.006) | 0.000 (.000) | 1.133 (.005) | 1.031 (.020) | 0.085 (.008) | 0.087 (.009) | |
| SIL-LS | Y | 1.117 (.004) | 0.913 (.020) | 0.082 (.010) | 0.079 (.006) | 1.124 (.004) | 0.970 (.020) | 0.073 (.009) | 0.001 (.001) | 1.122 (.005) | 0.976 (.020) | 0.096 (.010) | 0.080 (.008) | |
As the proposed methods rely on the graph information, we conduct the sensitivity analysis taking into account uncertainty of the graphical knowledge and inconsistency with the regression coefficients. In the sensitivity analysis, we randomly remove about 20% of edges from the true graph and use the reduced graph as a working graph. This mimics the intermediate situation where only strong interactions are known or the case where there are potentially missing edges (partial correlations) due to a screening of predictors. In Table 3, we can see the performance of the methods that use the graph information deteriorates, while the methods that do not use the graph information remain similar compared to Table 1. However, the difference for our methods is very small compared to that of SRIG, which seems attributed to the effect of integrative learning. This lends support to robustness of our methods to misspecified graphical information with missing edges.
TABLE 3.
Sensitivity analysis results. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, ; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L2; average L2 distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.
| Type | Method | Scenario 1 | Scenario 2 | Scenario 3 | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MSE | L 2 | FPR | FNR | MSE | L 2 | FPR | FNR | MSE | L 2 | FPR | FNR | |||
|
| ||||||||||||||
| FHT | Lasso | 1.267 (.004) | 1.508 (.009) | 0.322 (.005) | 0.250 (.005) | 1.349 (.004) | 1.925 (.014) | 0.400 (.006) | 0.226 (.006) | 1.279 (.005) | 1.565 (.014) | 0.329 (.005) | 0.285 (.007) | |
| Enet | 1.267 (.004) | 1.508 (.009) | 0.322 (.005) | 0.250 (.005) | 1.349 (.004) | 1.925 (.014) | 0.400 (.006) | 0.226 (.006) | 1.279 (.005) | 1.565 (.014) | 0.329 (.005) | 0.285 (.007) | ||
| SRIG | Y | 1.184 (.005) | 1.220 (.018) | 0.150 (.006) | 0.166 (.006) | 1.245 (.006) | 1.495 (.024) | 0.275 (.009) | 0.158 (.007) | 1.239 (.006) | 1.469 (.018) | 0.151 (.005) | 0.199 (.007) | |
|
| ||||||||||||||
| IHM | gLasso | 1.178 (.004) | 1.256 (.011) | 0.583 (.008) | 0.024 (.004) | 1.254 (.004) | 1.685 (.012) | 0.717 (.007) | 0.005 (.002) | 1.191 (.004) | 1.310 (.013) | 0.607 (.009) | 0.084 (.006) | |
| L2 gMCP | 1.148 (.004) | 1.080 (.019) | 0.251 (.020) | 0.166 (.016) | 1.161 (.004) | 1.108 (.017) | 0.168 (.015) | 0.042 (.006) | 1.146 (.005) | 1.080 (.019) | 0.182 (.014) | 0.214 (.010) | ||
| SIL-Lasso | Y | 1.108 (.004) | 0.917 (.015) | 0.221 (.020) | 0.006 (.002) | 1.141 (.003) | 1.102 (.012) | 0.257 (.016) | 0.002 (.001) | 1.114 (.004) | 0.969 (.015) | 0.231 (.017) | 0.017 (.004) | |
| SIL-MCP | Y | 1.110 (.004) | 0.931 (.016) | 0.111 (.017) | 0.032 (.005) | 1.124 (.004) | 0.942 (.015) | 0.073 (.010) | 0.006 (.002) | 1.112 (.004) | 0.951 (.014) | 0.099 (.015) | 0.060 (.007) | |
| SIL-LS | Y | 1.106 (.004) | 0.906 (.015) | 0.124 (.018) | 0.030 (.005) | 1.122 (.003) | 0.949 (.011) | 0.082 (.012) | 0.008 (.002) | 1.107 (.004) | 0.927 (.013) | 0.081 (.014) | 0.054 (.006) | |
|
| ||||||||||||||
| IHT | sgLasso | 1.189 (.004) | 1.290 (.014) | 0.576 (.016) | 0.035 (.006) | 1.263 (.004) | 1.719 (.017) | 0.688 (.012) | 0.017 (.003) | 1.204 (.005) | 1.363 (.015) | 0.556 (.016) | 0.104 (.008) | |
| L1 gMCP | 1.187 (.007) | 1.254 (.020) | 0.070 (.008) | 0.350 (.011) | 1.179 (.006) | 1.124 (.023) | 0.056 (.004) | 0.132 (.012) | 1.164 (.005) | 1.141 (.020) | 0.074 (.005) | 0.321 (.012) | ||
| SIL-Lasso | Y | 1.113 (.004) | 0.934 (.016) | 0.217 (.019) | 0.015 (.003) | 1.149 (.004) | 1.131 (.014) | 0.247 (.015) | 0.014 (.002) | 1.115 (.004) | 0.968 (.012) | 0.235 (.017) | 0.020 (.004) | |
| SIL-MCP | Y | 1.130 (.004) | 1.026 (.016) | 0.042 (.006) | 0.105 (.007) | 1.135 (.004) | 0.981 (.021) | 0.029 (.004) | 0.047 (.005) | 1.127 (.004) | 1.015 (.015) | 0.035 (.005) | 0.121 (.009) | |
| SIL-LS | Y | 1.120 (.004) | 0.971 (.015) | 0.058 (.009) | 0.090 (.006) | 1.137 (.003) | 0.997 (.013) | 0.049 (.007) | 0.060 (.004) | 1.122 (.004) | 0.989 (.016) | 0.044 (.007) | 0.110 (.008) | |
6 |. APPLICATION
Alzheimer’s disease (AD) is a major cause of dementia. The Alzheimer’s disease neuroimaging initiative (ADNI) is a large scale multisite longitudinal study where researchers at 63 sites track the progression of AD in the human brain through the process of normal aging, early mild cognitive impairment (EMCI), and late mild cognitive impairment (LMCI) to dementia or AD. Its goal is to validate diagnostic and prognostic biomarkers that can predict the progress of AD.
In our data analysis, we investigate the association of patients’ gene expression levels with an imaging marker that captures AD progression. Specifically, we treat the fluorodeoxyglucose positron emission tomography (FDG-PET) averaged over the regions of interest (ROI) as the response variable, which measures cell metabolism. Cells affected by AD tend to show reduced metabolism. Since the association of FDG with gene expression levels may change at different stages of AD, we divide the total of 675 subjects into three groups depending on their baseline disease status, namely, CN (cognitively normal, n = 229), MCI (EMCI+LMCI, n = 402), and AD (n = 44).
The samples in each group are randomly split into a training set (50%), a validation set (25%), and a testing set (25%). For each split, we fit with our method and the existing methods considered in Section 5 plus some fully homogeneous models to check the heterogeneity of the datasets, and report the prediction errors for the testing samples. The regularization parameters of all methods are tuned by validation method and the graph information is obtained from KEGG. This procedure is repeated for 200 random splits of the data and the average squared prediction errors are reported in Table 4.
TABLE 4.
Average prediction errors for ADNI dataset. FHM; fully homogeneous models, FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, ; Y indicates the method incorporates graph information.
| Type | Method | MCI | AD | CN | |
|---|---|---|---|---|---|
|
| |||||
| FHM | Lasso | 0.926 | 1.094 | 1.020 | |
| Enet | 0.901 | 1.075 | 0.987 | ||
| SRIG | Y | 0.955 | 1.063 | 1.035 | |
|
| |||||
| FHT | Lasso | 0.916 | 1.028 | 0.996 | |
| Enet | 0.881 | 0.983 | 0.961 | ||
| SRIG | Y | 0.933 | 1.035 | 1.008 | |
|
| |||||
| IHM | gLasso | 0.934 | 1.017 | 0.991 | |
| L2 gMCP | 0.898 | 1.027 | 1.005 | ||
| SIL-Lasso | Y | 0.873 | 0.946 | 0.946 | |
| SIL-MCP | Y | 0.876 | 0.947 | 0.948 | |
| SIL-LS | Y | 0.879 | 0.948 | 0.945 | |
|
| |||||
| IHT | sgLasso | 0.939 | 1.022 | 0.996 | |
| L1 gMCP | 0.914 | 1.045 | 1.001 | ||
| SIL-Lasso | Y | 0.862 | 0.950 | 0.940 | |
| SIL-MCP | Y | 0.878 | 0.934 | 0.955 | |
| SIL-LS | Y | 0.881 | 0.941 | 0.949 | |
As shown in Table 4, all of the FHM methods tend to underperform the corresponding FHT methods, suggesting that the model of interest likely has different parameters for different groups. Despite such heterogeneity, our methods show best prediction performance for all groups. The existing integrative learning approaches, which do not incorporate network information, seem to have difficulty integrating information from different datasets.
Another benefit of incorporating graphical pathway information is enhanced interpretability of the selected genes. To confirm, we conduct the pathway enrichment analysis based on the 30 most frequently selected genes of each method during the 200 repeats. Table 5 include 10 enriched pathways that are related to Alzheimer disease and the associated p-values. Any method that does not incorporate graph information, including the existing integrative learning approaches, has no enriched pathway. Except SILs, only the fully heterogeneous SRIG yields some enriched pathways. However, the p-values of SRIG tend to be larger than those of our methods.
TABLE 5.
Ten enriched pathways and p-values for each method. ‘-’ indicates not enriched in the genes selected by the method. FHM; fully homogeneous models, FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, ; Y indicates the method incorporates graph information, P1; AGE-RAGE signaling pathway, P2; Angiopoietin receptor Tie2-mediated signaling, P3; Chemokine signaling pathway, P4; CXCR4-mediated signaling events, P5; Glucocorticoid receptor regulatory network, P6; IL2-mediated signaling events, P7; MAPKinase Signaling Pathway, P8; Prolactin signaling pathway, P9; Signaling by PDGF, P10; Tuberculosis.
| Type | Method | P1 | P2 | P3 | P4 | P5 | P6 | P7 | P8 | P9 | P10 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| FHM | Lasso | - | - | - | - | - | - | - | - | - | - | |
| Enet | - | - | - | - | - | - | - | - | - | - | ||
| SRIG | Y | - | - | - | - | - | - | - | - | - | - | |
|
| ||||||||||||
| FHT | Lasso | - | - | - | - | - | - | - | - | - | - | |
| Enet | - | - | - | - | - | - | - | - | ||||
| SRIG | Y | 1.6e-5 | 5.8e-5 | - | - | 2.9e-4 | 7.3e-5 | 2.8e-4 | 1.8e-4 | - | - | |
|
| ||||||||||||
| IHM | gLasso | - | - | - | - | - | - | - | - | - | - | |
| L2 gMCP | - | - | - | - | - | - | - | - | ||||
| SIL-Lasso | Y | 2.1e-9 | 4.2e-6 | - | 9.7e-8 | 1.1e-6 | 5.8e-6 | 1.1e-6 | - | 1.6e-6 | 2.9e-6 | |
| SIL-MCP | Y | 1.1e-6 | 1.9e-6 | 2.1e-5 | 1.2e-6 | 1.6e-5 | 4.1e-8 | 1.6e-5 | 1.9e-7 | 4.9e-6 | 1.9e-5 | |
| SIL-LS | Y | 1.3e-6 | 1.1e-4 | - | - | 2.0e-5 | 1.5e-4 | 1.1e-4 | 1.0e-5 | 8.1e-5 | - | |
|
| ||||||||||||
| IHT | sgLasso | - | - | - | - | - | - | - | - | - | - | |
| L1 gMCP | - | - | - | - | - | - | - | - | ||||
| SIL-Lasso | Y | 5.8e-11 | 1.3e-9 | 7.7e-9 | 1.3e-12 | 1.3e-6 | 1.3e-7 | 1.3e-6 | - | - | 1.7e-7 | |
| SIL-MCP | Y | - | 2.1e-4 | 2.6e-4 | - | 1.5e-6 | - | 5.0e-6 | 2.7e-5 | 2.0e-4 | ||
| SIL-LS | Y | 1.3e-6 | 1.1e-4 | - | 4.4e-5 | - | 1.5e-4 | 1.1e-4 | 1.0e-5 | - | - | |
7 |. DISCUSSION
We have proposed a novel integrative learning method, called SIL, which can incorporate the graphical structure of features. SIL possesses appealing theoretical properties, is scalable to high-dimensional data, and has been shown to outperform existing integrative learning methods through a simulation study and a real data analysis.
In practice, the ground truth sparsity structure of β0 may not be consistent with the graphical structure. However, when the discrepancy is moderate, our proposed method will still show reasonably good performance by detecting the subset of groups that cover all or most of the nonzero coefficients. Note that the sensitivity analysis (Table 3), which is conducted partly in consideration of such inconsistency, suggests the proposed method is quite robust. Even when the graphical information is completely irrelevant to the sparsity structure, our method will not fail. The tuning procedure will discourage the group-wise selection and we can expect the performance to be comparable to that of the plain ridge regression.
On the other hand, it is widely acknowledged that the graph information obtained from existing databases could be inaccurate or incomplete. It is potentially of future interest to investigate approaches that are more robust to incomplete graph information. One potential approach is to combine the graph information from existing databases and the estimated graph information using the data being analyzed. Another direction for future research is to incorporate graph information that may vary between datasets.
Supplementary Material
ACKNOWLEDGMENTS
This work is partly supported by NIH grant RF1AG063481. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The complete ADNI Acknowledgement is available here (click).
Footnotes
Conflict of interest
The authors declare no potential conflict of interests.
Financial disclosure
None reported.
SUPPORTING INFORMATION
The supplementary material available online includes additional algorithms and the proof of theorems.
References
- [1].Armagan A, Dunson DB, and Lee J, 2013: Generalized double pareto shrinkage. Statistica Sinica, 23, no. 1, 119–143. [PMC free article] [PubMed] [Google Scholar]
- [2].Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, and Sherlock G, 2000: Gene ontology: tool for the unification of biology. Nature Genetics, 25, no. 1, 25–29, doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Beck A and Teboulle M, 2009: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, no. 1, 183–202. [Google Scholar]
- [4].Bühlmann P and van de Geer S, 2011: Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer. [Google Scholar]
- [5].Bickel PJ, Ritov Y, and Tsybakov AB, 2009: Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37, no. 4, 1705–1732, doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620 [DOI] [Google Scholar]
- [6].Candès EJ, Wakin MB, and Boyd SP, 2008: Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14, no. 5, 877–905, doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]
- [7].Chang C, Kundu S, and Long Q, 2018: Scalable bayesian variable selection for structured high-dimensional data. Biometrics, 74, no. 4, 1372–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Chang C, Oh J, and Long Q, 2020: Gria: Graphical regularization for integrative analysis. Proceedings of the 2020 SIAM International Conference on Data Mining, 604–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Fan J and Li R, 2001: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, no. 456, 1348–1360, doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
- [10].Gong P, Zhang C, Lu Z, Huang JZ, and Ye J, 2013: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, JMLR.org, ICML’13, II–37–II–45. URL http://dl.acm.org/citation.cfm?id=3042817.3042898 [PMC free article] [PubMed] [Google Scholar]
- [11].Huang Y, Zhang Q, Zhang S, Huang J, and Ma S, 2017: Promoting similarity of sparsity structures in integrative analysis with penalization. Journal of the American Statistical Association, 112, no. 517, 342–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Jacob L, Obozinski G, and Vert J-P, 2009: Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA, ICML ’09, 433–440. [Google Scholar]
- [13].Kanehisa M, Furumichi M, Tanabe M, Sato Y, and Morishima K, 2017: Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45, no. D1, D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Li C and Li H, 2008: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics, 24, no. 9, 1175–1182. [DOI] [PubMed] [Google Scholar]
- [15].Li F and Zhang NR, 2010: Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics. Journal of the American Statistical Association, 105, no. 491, 1202–1214. [Google Scholar]
- [16].Li Q, Wang S, Huang C-C, Yu M, and Shao J, 2014: Meta-analysis based variable selection for gene expression data. Biometrics, 70, no. 4, 872–880. [DOI] [PubMed] [Google Scholar]
- [17].Li Z, Chang C, Kundu S, and Long Q, 2020: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics, 21, no. 3, 610–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [18].Liu B, Wu C, Shen X, and Pan W, 2017: A novel and efficient algorithm for de novo discovery of mutated driver pathways in cancer. The annals of applied statistics, 11, no. 3, 1481. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [19].Liu J, Huang J, and Ma S, 2013: Incorporating network structure in integrative analysis of cancer prognosis data. Genetic Epidemiology, 37, no. 2, 173–183. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Liu J, Huang J, Zhang Y, Lan Q, Rothman N, Zheng T, and Ma S, 2014: Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics, 70, no. 3, 480–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Liu J, Ma S, and Huang J, 2014: Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics, 41, no. 1, 87–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [22].Ma S, Huang J, and Song X, 2011: Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics, 12, no. 4, 763–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [23].Pan W, Xie B, and Shen X, 2010: Incorporating predictor network in penalized regression with application to microarray data. Biometrics, 66, no. 2, 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [24].Stingo FC, Chen YA, Tadesse MG, and Vannucci M, 2011: Incorporating Biological Information into Linear Models: A Bayesian Approach to the Selection of Pathways and Genes. Annals of Applied Statistics, 5, no. 3, 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Tibshirani R, 1996: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, no. 1, 267–288. [Google Scholar]
- [26].Yu G and Liu Y, 2016: Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111, no. 514, 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Zhang C-H, 2010: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38, no. 2, 894–942, doi: 10.1214/09-AOS729. URL 10.1214/09-AOS729 [DOI] [Google Scholar]
- [28].Zhao P and Yu B, 2006: On model selection consistency of lasso. J. Mach. Learn. Res, 7, 2541–2563. [Google Scholar]
- [29].Zhao Q, Shi X, Huang J, Liu J, Li Y, and Ma S, 2015: Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdisciplinary Reviews: Computational Statistics, 7, no. 1, 99–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Zhao Y, Chang C, and Long Q, 2019: Knowledge-guided statistical learning methods for analysis of high-dimensional -omics data in precision oncology. JCO Precision Oncology, no. 3, 1–9, doi: 10.1200/PO.19.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [31].Zou H, 2006: The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, no. 476, 1418–1429. [Google Scholar]
- [32].Zou H and Hastie T, 2005: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, no. 2, 301–320. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
