Integrative Learning of Structured High-Dimensional Data from Multiple Datasets

Changgee Chang; Zongyu Dai; Jihwan Oh; Qi Long

doi:10.1002/sam.11601

. Author manuscript; available in PMC: 2024 Apr 1.

Published in final edited form as: Stat Anal Data Min. 2022 Nov 8;16(2):120–134. doi: 10.1002/sam.11601

Integrative Learning of Structured High-Dimensional Data from Multiple Datasets

Changgee Chang ^1,^*, Zongyu Dai ², Jihwan Oh ¹, Qi Long ^1,^*

PMCID: PMC10195070 NIHMSID: NIHMS1844882 PMID: 37213790

Summary

Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.

Keywords: integrative learning, horizontally partitioned data, knowledge-guided learning, network-based penalty, high-dimensional data

1 |. INTRODUCTION

Massive amounts of high-throughput -omics data that have been generated in recent studies offer great promises in deepening our understanding molecular underpinning and mechanisms for complex diseases such as Alzheimer’s disease and cancer. At the same time, they still present significant analytical challenges as the sample size in a single study is often small to moderate. There is a large body of literature on regularized regression models for the analysis of high-dimensional data in the setting where the number of variables is larger than the sample size. While many of these methods have appealing asymptotic properties, there is a growing recognition that their performance in practice is often unsatisfactory when the sample size is small and the signal-to-noise ratio is small. A number of approaches have been proposed to mitigate this limitation of regularized regressions, particularly for the analysis of genomics data.

One popular approach is to incorporate prior knowledge on high-dimensional predictors such as gene regulatory pathways and co-expression networks that are represented by graphs and can be obtained from public or commercial databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa et al. [13]) and Gene Ontology [2]. The knowledge-guided approach for structured data whose variables lie on a graph [15] has been adopted in supervised learning such as regression [14, 23, 26, 7] and in unsupervised learning [17, 18], through carefully designed penalty functions in a frequentist framework or prior specifications in a Bayesian framework. The rationale behind incorporating the graphical structure of features into supervised learning is the fact that phenotypic biomarkers are often manifested as a result of interaction between a group of genes (pathway). It is typically not the case that the important features are unrelated. Rather, one or more groups of closely related genes have the predictive power jointly. Therefore, the graphical information can be integrated by encouraging the group-wise selection of the model coefficients. For example, Li and Li [14], Pan et al. [23] propose network-based penalties which encourage joint selection of the predictors that are connected in the graph. Li and Zhang [15], Stingo et al. [24] use a Markov random field (MRF) prior combined with a spike and slab prior to encourage selection of connected predictors. More recently, Chang et al. [7] propose a structured shrinkage prior which mitigates some issues associated with prior Bayesian methods. While all the aforementioned methods use the predictor graph in an edge-by-edge manner and encourage selection of adjacent nodes, [26] proposed a method that uses the predictor graph in a node-by-node manner and encourage selection of the neighborhood group of each predictor. These knowledge-guided statistical learning methods have shown improved prediction accuracy and improved power for detecting weak yet important signals in finite samples and they encourage selection of pathways rather than individual features, leading to biologically more meaningful and interpretable results [30].

Another useful approach for mitigating the small sample size problem is integrative learning of multiple datasets that contain the same set of variables, also known as horizontally partitioned data. Multiple datasets are broadly defined as being collected from multiple studies/sites, from multiple sex/racial groups, or from multiple related disease groups. One key advantage of integrative learning in regression is that it improves the power for detecting important predictors that are shared across the datasets [29]. Ma et al. [22] proposed an integrative analysis approach that assumes the same sparsity structure of the regression coefficients across all datasets but allows for different effect sizes. The homogeneous sparsity assumption, however, can be overly restrictive in some applications. This assumption is relaxed in subsequent work by, among others, Li et al. [16], Liu et al. [20 21], Huang et al. [11] which allow for heterogeneity in sparsity structure across multiple datasets. However, these existing integrative learning methods do not account for important graph information for structured predictors such as genomics data, which has the potential to further improve the power of detecting weak yet important signals. Thus the existing heterogeneity models, as they allow the coefficient of a selected feature in some datasets to become zero, may miss such weak, yet important signals, weakening the power of integrative learning. To the best of our knowledge, there has been little work on incorporating graph information into integrative learning except for [19], and their approach relies on the assumption of homogeneous sparsity across all datasets, which may be unrealistic in many applications. For example, when performing integrative learning of datasets from populations at different stages of a disease, the set of important predictors may vary across these datasets.

To address this gap, we propose a novel integrative learning approach, called Structured Integrative Learning (SIL), which enables incorporation of structural information such as graphical knowledge on predictors. The key idea underlying SIL is that if a group of features as defined by pathways/networks are important in one dataset, they are likely to be important for the other datasets as well. Our approach is designed to select ‘groups of features’ jointly for all datasets rather than selecting ‘individual features’ jointly. As such, it is expected to further improve the power of detecting weak, yet important signals. Our proposed method can accommodate both homogeneous and heterogeneous sparsity structure, and in particular our method is theoretically justifiable. We show the oracle inequalities that provide the upper bounds of estimation and prediction errors in a non-asymptotic manner. We also investigate the conditions for the oracle property to hold in the setting where both the number of datasets and the number of predictors diverge. Another contribution of our work is to develop an iterative shrinkage-thresholding algorithm [3, 10] that fits our model, which is much more scalable than the (sub)gradient descent algorithm that has typically been used in prior work for integrative learning. We show that the proximal operators associated with our regularizers have analytic solutions and can be evaluated very efficiently.

We note that Chang et al. [8] has presented the intermediate results of our research on the proposed method. Compared to the earlier version, this work includes several significant improvements as follows. The current work presents a more general penalty formulation of which the penalty in the prior version can be viewed as a special case. Theoretical properties of the general penalty are rigorously investigated, while the earlier version includes no theoretical result. We include new regularizers that are based on log-sum penalty and the efficient algorithms for them. The regularizers in the prior paper are still included and compared in the simulation and data analysis studies. In the simulation study, we compare performance using both homogeneity data and heterogeneity data, while the earlier version only uses heterogeneity data. Moreover, we perform a sensitivity analysis which investigates robustness of our method against inaccurate and/or incomplete graphical information, while no sensitivity analysis in the earlier work. Finally, we include a pathway enrichment analysis in data analysis, which demonstrates our method yields outcomes that can be more interpretable and biological meaningful, while the prior work included no pathway enrichment analysis.

The remainder of this article is organized as follows. We describe the problem of interest and present our proposed method in Section 2, and then present the numerical algorithm in Section 3. In Section 4, we present the theoretical properties of the proposed method. In Section 5, we conduct simulation studies to evaluate the performance of our approach in comparison with several existing methods. In Section 6, we further illustrate the strengths of the proposed method through analysis of real data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). We conclude the paper with some discussion remarks in Section 7.

2 |. METHOD

2.1 |. Background

To fix ideas, consider fitting linear regression model using data from M datasets. In the m-th dataset, we have an n_m ×p predictor matrix X^m and an n_m × 1 response vector y^m, where n_m is the sample size of the m-th dataset and p is the number of predictors. Let $N = \sum_{m} n_{m}$ be the total sample size. The model of interest is the linear model

y^{m} = X^{m} β^{0 m} + e^{m}, m = 1, ..., M,

(1)

where $β^{0 m} = {(β_{1}^{0 m}, ..., β_{p}^{0 m})}^{T}$ is the p × 1 true coefficient vector and $e^{m} \sim 𝒩 (0, σ^{2} I)$ is the n_m × 1 error vector for the m-th dataset. The regularized least square loss function is generally given by

ℒ (B) = \sum_{m = 1}^{M} L_{m} (β^{m}) + P (B),

where B = [β¹ ⋯ β^M] is the p × M coefficient matrix, P(B) is a penalty on B, and

L_{m} (β^{m}) = \frac{1}{2 n_{m}} {‖ y^{m} - X^{m} β^{m} ‖}_{2}^{2} .

The most general estimator would allow all β^m to be different and use a separable penalty $P (B) = \sum_{m = 1}^{M} P_{m} (β^{m})$ . This is equivalent to independently minimizing

ℒ_{m} (β^{m}) = L_{m} (β^{m}) + P_{m} (β^{m}),

which is equivalent to analyze each dataset separately. We call this model the fully heterogeneous model. On the other hand, the least general estimator would assume all coefficients to be the same across all datasets; β^m ≡ β. This fully homogeneous model is equivalent to merging all datasets with each data point weighted by the reciprocal of the size of the dataset it belongs to. The weights prevent the large dataset from dominating the loss function and keep the coefficients from leaning favorably only to the large dataset.

Obviously, the full homogeneity assumption can be overly restrictive. Each dataset often has its own characteristics, and the association between the outcome and the predictors can be different. Ignoring the difference can lead to poor or suboptimal performance in estimation and prediction. On the other hand, the fully heterogeneous model fails to borrow information across datasets, and the regression model for each dataset can suffer the curse of dimensionality. The motivation of integrative learning is to aggregate common information from multiple datasets while accounting for heterogeneity across these datasets.

2.2 |. Structured Integrative Learning

In this work, we focus on the case where the graph information for predictors is the same across the M datasets. Denote by G = ⟨𝑉,E⟩ the graph on predictors X where 𝑉 = {1, ... , p} is the set of features and E is the set of edges between the features. Let A = [𝑎_jk] be the adjacency matrix associated with G and let $𝒜_{j} = {k : a_{j k} = 1} \cup {j}$ be the neighborhood of the j-th feature including itself. Let e = |E| be the number of edges in G and $a_{j} = | 𝒜_{j} |$ be the number of members in $𝒜_{j}$ . The graphical information on features often represents the partial correlation structure of the features. That is, the presence of an edge between features j and k implies the (j, k) entry of the feature precision matrix is nonzero, while an absence of edge means a zero entry. In analysis of genomics data, such graphs often represent gene regulatory pathways or co-expression networks which can be obtained from existing databases such as KEGG [13].

For Model (1), note that we have β^m = Ω^mc^m where Ω^m = n_mE(X^m𝑇X^m)⁻¹ and $c^{m} = \frac{1}{n_{m}} E (X^{m T} y^{m})$ . This yields

β^{m} = \sum_{j = 1}^{p} c_{j}^{m} ω_{j}^{m},

(2)

where $ω_{j}^{m}$ is the j-th column of Ω^m. Since the absence of an edge implies a zero partial correlation, we have $supp (ω_{j}^{m}) = 𝒜_{j}$ . Following the observation from Yu and Liu [26], either $β_{𝒜_{j}}^{m} \equiv {(β_{k}^{m})}_{k \in 𝒜_{j}}$ can have nonzero values (if $c_{j}^{m} \neq 0$ ), or there will be no contribution from the j-th group to the effect size β^m (if $c_{j}^{m} = 0$ ). The key premise of our work is that if a group $𝒜_{j}$ is important for one dataset, it is likely to be important for other datasets as well. To encourage joint selection of the feature groups $𝒜_{j}$ across all datasets, we propose the following penalty.

P (B) = λ ‖ B ‖_{τ}, ‖ B ‖_{τ} \equiv \inf_{B = \sum_{j} Γ_{j}, supp (γ_{j}^{m}) = 𝒜_{j}} \sum_{j = 1}^{p} τ_{j} ρ (Γ_{j}),

(3)

where $Γ_{j} = [\begin{array}{l} γ_{j}^{1} & \dots & γ_{j}^{M} \end{array}]$ in an arbitrary p × M matrix.

Note that the standard group lasso penalty such as $P (B) = λ \sum_{j = 1}^{p} τ_{j} {‖ B_{𝒜_{j}} ‖}_{F}$ where $B_{𝒜_{j}} = [\begin{array}{l} β_{𝒜_{j}}^{1} & \dots & β_{𝒜_{j}}^{M} \end{array}]$ and ‖ · ‖_F is the Frobenius norm has an undesirable characteristic. Since $𝒜_{j}$ are overlapping in general, it yields $β_{j}^{m} \neq 0$ if and only if all groups including j are selected. In other words, gene j can be selected if and only if all groups (pathways) including j are important, which is not plausible in biology. On the other hand, the proposed penalty (3) introduces latent coefficients Γ_j and can yield $β_{j}^{m} \neq 0$ if at least one group that contains j is selected. That is, if a pathway is important, then all genes in the pathway can become important. This is a required property to be consistent with (2). More details about the latent group lasso penalty can be found in Jacob et al. [12]. It is easy to see that applying the proposed penalty is equivalent to replacing β^m with $\sum_{j} γ_{j}^{m}$ in the loss function and enforcing a penalty on Γ = (Γ₁, ... , Γ_p) as follows.

L (Γ) = \sum_{m = 1}^{M} \frac{1}{2 n_{m}} {‖ y^{m} - X^{m} \sum_{j = 1}^{p} γ_{j}^{m} ‖}_{2}^{2}, P (Γ) = λ \sum_{j = 1}^{p} τ_{j} ρ (Γ_{j}), supp (γ_{j}^{m}) = 𝒜_{j} .

Minimizing L(Γ) + P(Γ) will yield the same solution as minimizing L(B) + P(B) with ${\hat{β}}^{m} = \sum_{j} {\hat{γ}}_{j}^{m}$ . The efficient algorithms for selected penalties will be presented in Section 3.

We propose the core penalty function ρ(Γ_j) has the following form.

ρ (Γ_{j}) = ρ_{1} \circ ρ_{2} (Γ_{j}) .

(4)

The outer penalty $ρ_{1} : ℝ_{+} \to ℝ_{+}$ determines the sparsity and unbiasedness levels and the positive halves of all popular concave penalty functions can be used; for example, lasso [25], SCAD [9], MCP [27], and log-sum [6, 1]. The inner penalty $ρ_{2} : ℝ^{p \times M} \to ℝ_{+}$ determines how the penalties of the columns $γ_{j}^{m}$ are combined, and yields different type of models. The choice of ρ₂(Γ) = ‖Γ‖_F selects all components in an all-in-or-all-out fashion and leads to the homogeneity model, but the choice of ρ₂(Γ) = ‖Γ^𝑇‖_2,1 (L_2,1 norm) allows each $γ_{j}^{m}$ to become zero and leads to the heterogeneity model. The combination of both ρ₂(Γ) = α‖Γ‖_F + (1 − α)‖Γ^𝑇‖_2,1 can also be considered.

For example, we can choose the log-sum penalty for ρ₁.

ρ_{L S} (x) = η \log (1 + x / η) .

(5)

Then, depending on the choice for ρ₂, we can have two types of penalties

P_{L S_{1}} (Γ) = λ \sum_{j = 1}^{p} τ_{j} ρ_{L S} ({‖ Γ_{j} ‖}_{F}), P_{L S_{2}} (Γ) = λ \sum_{j = 1}^{p} τ_{j} ρ_{L S} ({‖ Γ_{j}^{T} ‖}_{2, 1}) .

(6)

In both choices, the entries of $γ_{j}^{m}$ becomes zero or nonzero in an all-in-or-all-out fashion. The weights τ_j should take into account the size of the group $𝒜_{j}$ and thus is recommended to have the form $τ_{j} = \sqrt{a_{j}} d_{j}$ for some d_j > 0. We can have homogeneous penalty (d_j = 1), or we can reflect the importance of the group $𝒜_{j}$ by choosing adaptive penalty d_j. Noting that, for example, the coefficients are proportional to the absolute correlation between the response variable and the predictor j as in (2), d_j can be chosen to be reciprocal to the average of the absolute sample correlations $d_{j}^{- 1} = M^{- 1} | \sum_{m} x_{j}^{m T} e^{m} / n_{m} |$ .

Note that the L₁ or L₂ norm based penalty discourages to include highly correlated variables/groups simultaneously. In other words, true signals can be pushed away from the model by its highly correlated neighbors, and it is the case in our model as well. Due to the construction of $𝒜_{j}$ , some $𝒜_{j}$ ‘s are often similar or can even be identical. As a remedy to this issue, we follow the idea of the elastic net [32], which adds the ridge penalty.

P_{R} (B) = \frac{λ_{R}}{2} ‖ B ‖_{F}^{2} .

The ridge penalty seemingly reduces the correlations between groups and facilitates inclusion of all potentially important signals. This completes the objective function of our model.

ℒ (B) = L (B) + P_{R} (B) + P (B),

or equivalently

ℒ (Γ) = L (Γ) + P_{R} (Γ) + P (Γ), supp (γ_{j}^{m}) = 𝒜_{j},

where

P_{R} (Γ) = \sum_{j = 1}^{p} \frac{λ_{R} a_{j}}{2} {‖ Γ_{j} ‖}_{F}^{2} .

Many existing penalties can be viewed as a special case of ours. If we have only one dataset and we choose ρ₁(x) = x and ρ₂(Γ) = ‖Γ‖_F, then our method simplifies to the sparse regression incorporating graphical structure among predictors [26]. If no edge exists or no graphical information is available, we have $𝒜_{j} = {j}$ for all j. In this case, if we choose the minimax concave penalty for ρ₁, our method reduces to the methods introduced in Liu et al. [21]. If we choose ρ₁(x) = x and the Frobenius norm for ρ₂, our model simplifies to the integrative analysis model with the group lasso penalty proposed by Ma et al. [22].

3 |. ALGORITHM

In this section, we present an efficient algorithm to fit our model with the log-sum penalties defined in (6), which will enhance its usefulness in analysis of high-dimensional data such as genomics data. We also consider two other penalties in this work. Instead of (5), we can use the MCP penalty [27] for ρ₁.

ρ_{M C P} (x) = \int_{0}^{x} (1 - u / (λ η))_{+} d u .

(7)

Or, we can have the convex penalty

ρ_{1} (x) = x, ρ_{2} (Γ) = α ‖ Γ ‖_{F} + (1 - α) {‖ Γ^{T} ‖}_{2, 1} .

(8)

The algorithms for (7) and (8) can be found in Chang et al. [8].

Let $δ^{m} = {(δ_{1}^{m T}, ..., δ_{p}^{m T})}^{T}$ and $Δ_{j} = [\begin{array}{l} δ_{j}^{1} & \dots & δ_{j}^{M} \end{array}]$ where $δ_{j}^{m} = γ_{𝒜_{j}, j}^{m}$ is the vector of unconstrained coefficients in $γ_{j}^{m}$ . Let $Z_{j}^{m}$ be the submatrix of X^m including the columns corresponding to $𝒜_{j}$ and let $Z^{m} = [\begin{array}{l} Z_{1}^{m} & \dots & Z_{p}^{m} \end{array}]$ . Denoting Δ = (Δ₁, ... , Δ_p), our objective function can be decomposed into a differentiable part L(Δ)+ P_R(Δ) and a non-differentiable part $P_{L S_{2}} (Δ)$ or $P_{L S_{1}} (Δ)$ where

L (Δ) = \sum_{m = 1}^{M} \frac{1}{2 n_{m}} {‖ y^{m} - Z^{m} δ^{m} ‖}_{2}^{2}, P_{R} (Δ) = \sum_{j = 1}^{p} \frac{λ_{R} a_{j}}{2} {‖ Δ_{j} ‖}_{F}^{2},

P_{L S_{1}} (Δ) = \sum_{j = 1}^{p} λ η τ_{j} \log (1 + {‖ Δ_{j} ‖}_{F} / η), P_{L S_{2}} (Δ) = \sum_{j = 1}^{p} λ η τ_{j} \log (1 + {‖ Δ_{j}^{T} ‖}_{2, 1} / η) .

We use the accelerated proximal gradient descent algorithm (FISTA, Beck and Teboulle [3]) to fit our models. While the log-sum penalty is not convex, its second derivative is bounded from below and it satisfies the criteria in Gong et al. [10]. Propositions 1 and 2 describe how to evaluate the proximal operators for $P_{L S_{1}}$ and $P_{L S_{2}}$ , respectively. Let $\tilde{Δ}$ be the proximal operator associated with penalty P(Δ) evaluated at Δ, as defined below.

\tilde{Δ} \equiv {prox}_{t} (Δ) \equiv \underset{W = (W_{1}, ..., W_{p})}{argmin} (\frac{1}{2 t} \sum_{j = 1}^{p} {‖ W_{j} - Δ_{j} ‖}_{F}^{2} + P (W)) .

Proposition 1.

For t < η∕(λ max_j τ_j), the proximal operator associated with the penalty $P_{L S_{1}} (Δ)$ is given by

{\tilde{Δ}}_{j} = {(1 - λ t τ_{j} h_{j} / {‖ Δ_{j} ‖}_{F})}_{+} Δ_{j}, j = 1, ..., p,

(9)

where

h_{j} = \frac{1 + {‖ Δ_{j} ‖}_{F} / η - \sqrt{{(1 + {‖ Δ_{j} ‖}_{F} / η)}^{2} - 4 λ t τ_{j} / η}}{2 λ t τ_{j} / η} .

(10)

Proposition 2.

For t < η∕(λ max_j τ_j), the proximal operator associated with the penalty $P_{L S_{2}} (Δ)$ is given by

{\tilde{δ}}_{j}^{m} = {(1 - λ t τ_{j} h_{j} / {‖ δ_{j}^{m} ‖}_{2})}_{+} δ_{j}^{m},

(11)

where h_j satisfies

h_{j} = \frac{1}{1 + \sum_{l = 1}^{M} {({‖ δ_{j}^{l} ‖}_{2} - λ t τ_{j} h_{j})}_{+} / η} .

(12)

The proofs for Propositions 1 and 2 are included in Web Appendix A. Note that (12) is a piecewise quadratic equation in h_j whose analytic solution can be easily obtained as follows. Let $ξ_{j}^{m} = {‖ δ_{j}^{m} ‖}_{2} / (λ t τ_{j})$ . Equation (12) can be rewritten as

h_{j} = \frac{1}{1 + λ t τ_{j} \sum_{l = 1}^{M} {(ξ_{j}^{l} - h_{j})}_{+} / η} .

(13)

Sort $ξ_{j}^{1}, ..., ξ_{j}^{M}$ in ascending order $(𝒪 (M \log M))$ and assume, for simplicity, that

0 = ξ_{j}^{0} \leq ξ_{j}^{1} \leq \dots \leq ξ_{j}^{K} < 1 \leq \dots \leq ξ_{j}^{M},

for some K ≤ M. First, note that h_j = 1 if and only if $ξ_{j}^{M} \leq 1$ . Suppose $ξ_{j}^{M} > 1$ and $h_{j} \in [ξ_{j}^{k - 1}, ξ_{j}^{k})$ . From (13), we have the candidate solution h^k_j as follows.

h_{j}^{k} = \frac{1 + λ t τ_{j} \sum_{l = k}^{M} ξ_{j}^{l} / η - \sqrt{{(1 + λ t τ_{j} \sum_{l = k}^{M} ξ_{j}^{l} / η)}^{2} - 4 λ t τ_{j} (M - k + 1) / η}}{2 λ t τ_{j} (M - k + 1) / η} .

If $h_{j}^{k} \in [ξ_{j}^{k - 1}, ξ_{j}^{k})$ for some k ∈ {1, ... , K}, it is indeed the solution for (12). Otherwise, $h_{j}^{K + 1} \in [ξ_{j}^{K}, 1)$ is the solution for (12).

The algorithm uses the standard accelerated proximal gradient descent algorithm with the backtracking line search. Each iteration requires $𝒪 (p N + M e)$ for $P_{L S_{1}}$ and $𝒪 (p N + M e + p M \log M)$ for $P_{L S_{2}}$ . We have also investigated the non-accelerated proximal gradient descent algorithm and found that the accelerated version has a substantial advantage when the sample size is small and the ridge penalty λ_R is 0 or close to 0.

4 |. THEORETICAL PROPERTIES

In this section, we study the theoretical properties of the proposed method. The main goal is to provide the conditions for which the oracle inequality and the oracle property hold in the context of integrative analysis. Although the Theorem statements and the proofs may look similar to those in Yu and Liu [26], the implications apply to the analysis of a large number of datasets. Also, note that the result of $\sqrt{n}$ -consistency presented here (Theorem 3) is more general than that of Yu and Liu [26], as the oracle property therein is discussed with fixed p only.

Let $J_{0}^{m} = {j : β_{j}^{0 m} \neq 0}$ be the set of important variables of the mth dataset and $J_{0} = \cup_{m = 1}^{M} J_{0}^{m}$ be the union of all important variables. Define s₀ = |J₀| as the number of all important variables. Let $J_{1} = {j \in J_{0} : 𝒜_{j} \subset J_{0}}$ be the set of groups which contain important features only, $J_{2} = {j : 𝒜_{j} \cap J_{0} \neq \emptyset}$ be the set of groups which contain at least one important gene, and $J_{3} = {j \in J_{0}^{c} : 𝒜_{j} \subset J_{0}^{c}}$ be the set of groups which contain unimportant features only, and let s₁ = |J₁|, s₂ = |J₂|, and s₃ = |J₃|. Focusing on general penalties with ρ(⋅) ≥ ‖ ⋅ ‖_F, we first present the oracle inequalities under homogeneous penalty weights d_j = 1, i.e., $τ_{j} = \sqrt{a_{j}}$ for j = 1, ... , p. These are non-asymptotic finite sample properties which account for a diverging number of datasets and predictors. Then, we discuss the model selection consistency and the asymptotic normality under adaptive penalty weights d_j. To this end, define $d^{*} = \max_{j \in J_{1}} d_{j}$ and $d_{*} = \min_{j \in J_{1}^{c}} d_{j}$ . Noting that the ridge penalty is not required for the oracle properties to hold and only needed to be small enough, we set λ_R = 0 in this section.

For simplicity, we assume n_m = n for m = 1, ... , M and thus N = Mn, and define

y = \frac{1}{\sqrt{n}} [\begin{matrix} y^{1} \\ ⋮ \\ y^{M} \end{matrix}], X = \frac{1}{\sqrt{n}} diag (X^{1}, ..., X^{M}), β = vec (B), e = \frac{1}{\sqrt{n}} [\begin{matrix} e^{1} \\ ⋮ \\ e^{M} \end{matrix}] .

We vectorize Γ_j in this section, γ_j = vec(Γ_j), and the penalty is written as

‖ β ‖_{τ} = \min_{β = \sum_{j = 1}^{p} γ_{j}} \sum_{j = 1}^{p} τ_{j} ρ (γ_{j}), supp (γ_{j}^{m}) = 𝒜_{j},

with a little abuse of notation; ρ(γ_j) ≡ ρ(Γ_j). Then, the objective function goes as follows.

\frac{1}{2} ‖ y - X β ‖_{2}^{2} + λ \sum_{j = 1}^{p} ‖ β ‖_{τ} .

(14)

Let $\hat{β}$ be the minimizer of (14) and ${\hat{γ}}_{1}, ..., {\hat{γ}}_{p}$ be an optimal decomposition of $\hat{β}$ .

We present the oracle inequalities for estimation and prediction errors. Let $Q_{1} = \max_{m, j} {‖ x_{j}^{m} ‖}_{2}^{2} / n$ be the largest empirical variance of predictors. Let β⁰ = vec(B⁰) be the stacked true regression coefficients and $β_{J_{0}}^{0}$ be the stacked nonzero true regression coefficients. Let $β_{*}^{0}$ be the smallest absolute value of nonzero true coefficients across all datasets. For $β \in ℝ^{p}$ , let $U (β) = {(γ_{1}, ..., γ_{p}) : \sum_{j} γ_{j} = β, ‖ β ‖_{τ} = \sum_{j} τ_{j} ρ (γ_{j}), supp (γ_{j}^{m}) = 𝒜_{j}}$ , be the set of all optimal decompositions of β, and K_τ(β), be the number of nonzero γ_j’s in the optimal decomposition of β which has the minimal number of nonzero γ_j ‘s, i.e., $K_{τ} (β) = \min_{Γ \in U (β)} | {j : γ_{j} \neq 0} |$ . Denote $K_{τ} = \sup_{supp (β) \subset (\cup_{j \in J_{2}} 𝒜_{j})} K_{τ} (β)$ . We can check J₁ = J₀, J₂ = J₀, $J_{3} = J_{0}^{c}$ , and K_τ = s₀ if the graph G has no edge. We need the following assumptions.

Assumption 1.

The important feature set J₀ is covered by ${𝒜_{j} : j \in J_{1}}$ . That is, $\cup_{j \in J_{1}} 𝒜_{j} = J_{0}$ .

Assumption 2.

The errors $e^{m} \sim_{i i d} 𝒩 (0, σ^{2} I)$ for m = 1, ... , M.

Assumption 3.

There exists a constant κ > 0 such that

\inf_{| J | \leq s_{2}, β \in ℝ^{ρ} \ {0}} \inf_{Γ \in 𝒯_{τ} (β, J)} \frac{‖ X β ‖_{2}}{\sqrt{\sum_{j \in J} τ_{j}^{2} ρ {(γ_{j})}^{2}}} \geq κ,

where $𝒯_{τ} (β, J)$ is the set of all optimal decompositions Γ = (γ₁, ... , γ_p) of β such that $\sum_{j \in J^{c}} τ_{j} ρ (γ_{j}) \leq 3 \sum_{j \in J} τ_{j} ρ (γ_{j})$ .

In order to select the correct model, the groups that include any unimportant variable must not be selected and only the groups that have important variables only may be selected. Assumption 1 ensures that all important variables are covered by the groups with important variables only. Although we assume Gaussian errors in Assumption 2, the asymptotic properties presented in this paper hold for any iid mean zero sub-Gaussian errors. Assumption 3 is similar to the restricted eigenvalue condition or the compatibility condition [5] which is commonly used for these types of inequalities but has been tailored to our proposed penalty.

Theorem 1.

(Oracle inequalities) Suppose Assumptions 1, 2, and 3 hold. Assume ρ(γ) is a norm such that ρ(γ) ≥ ‖γ‖₂. Let d_j = 1, i.e., $τ_{j} = \sqrt{a_{j}}$ for j = 1, ... , p. If we choose $λ \geq 4 σ \sqrt{M Q_{1}} \sqrt{\frac{A + 2 \log (M p)}{n}}$ for some A > 0, then the following inequalities hold with probability at least 1 − 2exp(−A∕2).

{‖ X (\hat{β} - β^{0}) ‖}_{2} \leq \frac{4 λ K_{τ}^{1 / 2}}{κ}, {‖ \hat{β} - β^{0} ‖}_{τ} \leq \frac{16 λ K_{τ}}{κ^{2}}, {‖ \hat{β} - β^{0} ‖}_{2} \leq \frac{16 λ K_{τ}}{κ^{2}} .

Please see Web Appendix B for proofs. Note that the results of Theorem 1 are general and consistent with the results shown in existing literature. For example, if M = 1 and we choose ρ₁(x) = x and ρ₂(Γ) = ‖Γ‖_F, we obtain the same results as in Yu and Liu [26]. If, in addition, there is no edge in the graph, we obtain the results similar to Bickel et al. [5].

We now present the oracle property focusing on the homogeneity model ρ(γ) = ‖γ‖₂. The objective function can be written in terms of Γ as follows.

\frac{1}{2} {‖ y - X \sum_{j = 1}^{p} γ_{j} ‖}_{2}^{2} + λ \sum_{j = 1}^{p} τ_{j} {‖ γ_{j} ‖}_{2}, supp (γ_{j}^{m}) = 𝒜_{j} .

(15)

Let ${\hat{γ}}_{1}, ..., {\hat{γ}}_{p}$ be the minimizer of (15) and $\hat{β} = \sum_{j = 1}^{p} {\hat{γ}}_{j}$ be the solution. Let $ℛ \subset 2^{J_{1}}$ represent the set of subsets of ${𝒜_{j} : j \in J_{1}}$ which covers the important variables J₀. That is, $R \in ℛ$ if and only if $\cup_{j \in R} 𝒜_{j} = J_{0}$ Define $R_{0} = {argmin}_{R \in ℛ} \sum_{j \in R} τ_{j}^{2}$ and $S_{0} = \sum_{j \in R_{0}} a_{j}$ . This set is not empty due to Assumption 1. Note that we have 𝑆₀ = s₀ if the graph G has no edge. Let Q₂ > 0 be the smallest eigenvalue of $X_{J_{0}}^{T} X_{J_{0}}$ and let $ξ = {‖ X_{J_{0}^{c}}^{T} X_{J_{0}} {(X_{J_{0}}^{T} X_{J_{0}})}^{- 1} ‖}_{\infty}$ . In Theorems 2 and 3, we present low level conditions required for model selection consistency and asymptotic normality, respectively. In Corollaries 1 and 2, we list conditions for individual parameters required for the oracle property, which will depend on the adaptivity of the penalty weights d_j.

Theorem 2.

(Model selection consistency) Suppose Assumptions 1 and 2 hold. Consider ρ(γ) = ‖γ‖₂. If

\frac{\sqrt{\log (M s_{0})}}{β_{*}^{0} \sqrt{n Q_{2}}} + \frac{λ \sqrt{S_{0}} d^{*}}{Q_{2} β_{*}^{0}} + \frac{\sqrt{M Q_{1} \log M (p - s_{0})}}{λ d_{*} \sqrt{n}} + \frac{\max (ξ, 1) d^{*} \sqrt{M S_{0}}}{d_{*}} \to 0,

(16)

then we have $sign (\hat{β}) = sign (β^{0})$ with probability tending to 1.

Remark 1.

The first two terms in (16) control the deviation of the nonzero coefficients from their ground truth. The last two terms in (16) ensure the penalties are large enough to suppress the coefficients of unimportant predictors.

Our method also possesses the property of asymptotic normality. However, in order to have $\sqrt{n}$ -consistency, we need a stronger condition compared to the model selection consistency.

Theorem 3.

(Asymptotic normality) Assume the conditions in Theorem 2, and further assume

\frac{λ d^{*} \sqrt{n S_{0}}}{\sqrt{Q_{2}}} \to 0 .

(17)

Let $v = α^{T} {(X_{J_{0}}^{T} X_{J_{0}})}^{- 1} α$ for any sequence of nonzero vector 𝜶 with length MJ₀. Then, we have

\sqrt{n} α^{T} ({\hat{β}}_{J_{0}} - β_{J_{0}}^{0}) / \sqrt{v} \to_{d} 𝒩 (0, σ^{2}) .

We now investigate conditions for individual factors which guarantee the oracle property.

Assumption 4.

𝑆₀ ≍ s₀ ≍ n^α where 0 ≤ α < 1.

Assumption 5.

𝑄₁ ≍ 𝑄₂ ≍ ξ ≍ 1.

Assumption 6.

$β_{*}^{0} ≍ s_{0}^{- 1 / 2} ≍ n^{- α / 2}$ .

Assumption 7.

λ = (n^{−(1+α)∕2}) and d^∗ ≍ 1.

The number of important variables must typically be less than the sample size. This is also connected in part to the condition on the smallest eigenvalue 𝑄₂ of $X_{J_{0}}^{T} X_{J_{0}}$ as in Assumption 5. The predictors can always be standardized, so we can have 𝑄₁ ≍ 1 as well. The assumption ξ ≍ 1 is similar to but weaker than the irrepresentable condition [28] since the bound needs not be less than 1. We consider the signal-to-noise ratio fixed at a constant level. Therefore, ${‖ β_{J_{0}}^{0 m} ‖}_{2} ≍ 1$ and Assumption 6 are plausible. Assumption 7 sets a penalty cap which limits the bias for important variables caused by the penalty. The conditions for M, p and the lower bound of λ depend on the minimum adaptive penalty weights d_∗ on the unimportant variables.

Corollary 1.

(Strongly adaptive penalty weights) Suppose Assumptions 4–7 hold. If d_∗ ≍ N^γ∕2 with γ ≥ 1, then the conditions (16) and (17) are satisfied if

\log M = o (n^{1 - α}), \log p = o (n^{1 - α}), λ^{- 1} = o (n {(\log (M p))}^{- 1 / 2}) .

If the adaptive penalty weights for unimportant variables are chosen at a rate of $\sqrt{N}$ or higher, the number M of datasets our method can accommodate for the oracle property only depends on the number of important variables, and we can have exponentially growing number of datasets with respect to n raised to a certain power. However, we find that, if the penalty weights are weakly adaptive, which means the minimum adaptive penalty weights for unimportant variables are chosen at a rate lower than $\sqrt{N}$ , our method may only accommodate polynomially increasing number of datasets with respect to n.

Corollary 2.

(Weakly adaptive penalty weights) Suppose Assumptions 4–7 hold. If d_∗ ≍ N^γ∕2 with α < γ < 1, then the conditions (16) and (17) are satisfied if

M = o (n^{\frac{γ - α}{1 - γ}}), \log p = o (M^{- (1 - γ)} n^{γ - α}), λ^{- 1} = o (M^{- \frac{1 - γ}{2}} n^{\frac{1 + γ}{2}} {(\log (M p))}^{- \frac{1}{2}}) .

It is worth noting that while the oracle inequality (Theorem 1) holds with the convex penalty and no adaptation (d_j = 1), the oracle property (Theorems 2 and 3) requires an adaptive penalty. This result is consistent with the behavior of the ordinary lasso regression. The L₁ penalty can achieve the oracle inequality [4], but cannot achieve the oracle property without further assumptions [28]. The adaptive lasso [31] or non-convex penalties [9] can achieve the oracle property.

5 |. SIMULATION

We conduct a simulation study to evaluate the performance of our method compared to existing integrative learning methods that do not incorporate graph information. We compare fully heterogeneous (FHT; independent estimation and tuning) models, integrative homogeneity (IHM) models, and integrative heterogeneity (IHT) models. IHM and IHT refer to the homogeneity model and the heterogeneity model, respectively, as defined in Zhao et al. [29]. We denote our SIL methods by SIL-Lasso, SIL-MCP, and SIL-LS, which use (8), (7), and (5) for ρ₁, respectively. The heterogeneity SIL-Lasso uses (8) for ρ₂ while fixing α = 1 for its homogeneity version. The homogeneity versions of SIL-MCP and SIL-LS use ρ₂(Γ) = ‖Γ‖_F and the heterogeneity versions use ρ₂(Γ) = ‖Γ^𝑇‖_2,1.

The FHT competing models include Lasso [25], Enet [32], and SRIG [26], the IHM competitors include L₂ gMCP [21] and gLasso [22], and the IHT competitors include L₁ gMCP [21] and sgLasso (sparse gLasso), which uses

P (B) = λ α {‖ B^{T} ‖}_{2, 1} + λ (1 - α) ‖ B ‖_{1, 1} .

We describe how to generate the precision matrix of X^m. For each m = 1, ... , M, we generate a block diagonal matrix

Ω^{m} = diag (Ω_{1}^{m}, ..., Ω_{B}^{m}),

where each sub-matrix $Ω_{b}^{m}$ is a p_B × p_B symmetric matrix. We consider three different types of graphical structure for $Ω_{b}^{m}$ depending on scenarios. The detailed procedure goes as follows.

Set $Ω_{b}^{m}$ to a p_B × p_B zero matrix.
Depending on scenarios, generate the nonzero lower triangular entries specified below as $𝒱 (- 1.5, - 0.5)$ .
- Scenario 1 (ring type): ${[Ω_{b}^{m}]}_{p_{B}, 1}$ and ${[Ω_{b}^{m}]}_{k, k - 1}$ for k > 1 are nonzero.
- Scenario 2 (hub type): ${[Ω_{b}^{m}]}_{k 1}$ for k > 1 are nonzero.
- Scenario 3 (random type): Each ${[Ω_{b}^{m}]}_{j, k} (j > k)$ is nonzero with probability 3∕p_B.
Fill in the upper triangular entries; $Ω_{b}^{m} \leftarrow Ω_{b}^{m} + Ω_{b}^{m T}$ .
${[Ω_{b}^{m}]}_{j j} \leftarrow 0.5 - \sum_{k = 1}^{p_{B}} {[Ω_{b}^{m}]}_{j k}$ .
Normalize $Ω_{b}^{m}$ such that the diagonal elements of its inverse matrix become 1.

The true regression coefficients β^m is given by

β^{m} = {[\begin{array}{l} α^{T} Ω_{1}^{m} & α^{T} Ω_{2}^{m} & 0 & \dots & 0 \end{array}]}^{T},

for some vector 𝜶. To create heterogeneity, the second block of features is set to have no influence on the outcome variable with probability p_ht. That is, we have

β^{m} = {[\begin{array}{l} α^{T} Ω_{1}^{m} & 0 & 0 & \dots & 0 \end{array}]}^{T} w . p . p_{h t} .

For each scenario, each row of X^m is independently sampled from $𝒩 (0, {(Ω^{m})}^{- 1})$ . Then, the responses are generated by a linear model as follows.

y^{m} = X^{m} β^{m} + e^{m},

where $e^{m} \sim 𝒩 (0, σ^{2} I)$ . We generate a total of N = nM observations with each dataset assigned n samples.

We consider M = 5 datasets with p = 100 features (B = 10, p_B = 10). The error variance is σ² = 1 and we use α = [ 1 1∕3 ⋯ 1∕3]^T for scenarios 1 and 3 and α = [ 1 1∕4 ⋯ 1∕4]^T for scenario 2. This yields roughly 2.5 signal-to-noise ratio for all scenarios. We use the validation method for tuning our methods. The tuning parameters are selected simultaneously via a grid point search over the multi-dimensional tuning parameter space. For example, IHM-SIL-LS for Scenario 1 in Table 1 searches over the 25 × 10 × 6 points of (λ, η, λ_R). The tuple that minimizes the validated prediction error was selected and used for predicting the testing data. The training sample size is n = 200, the validation sample size is n_υ = 200, and the testing sample size is n_t = 1000. Every method is fitted for a total of 100 replicates and tuned by the validation method. In Tables, we report the simulation results evaluated by the mean squared prediction error (MSE), the average L₂ distance between the estimated coefficients and the true coefficients, the false positive rates (FPR), and the false negative rates (FNR).

TABLE 1.

Simulation results for homogeneity data. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, $𝒢$ ; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L₂; average L₂ distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.

Type	Method	$𝒢$	Scenario 1				Scenario 2				Scenario 3
Type	Method	$𝒢$	MSE	L ₂	FPR	FNR	MSE	L ₂	FPR	FNR	MSE	L ₂	FPR	FNR

FHT	Lasso		1.274 (.004)	1.509 (.010)	0.330 (.005)	0.238 (.005)	1.348 (.004)	1.902 (.014)	0.402 (.005)	0.234 (.006)	1.274 (.004)	1.539 (.014)	0.313 (.005)	0.297 (.008)
	Enet		1.274 (.004)	1.509 (.010)	0.330 (.005)	0.238 (.005)	1.348 (.004)	1.902 (.014)	0.402 (.005)	0.234 (.006)	1.274 (.004)	1.539 (.014)	0.313 (.005)	0.297 (.008)
	SRIG	Y	1.156 (.004)	1.052 (.008)	0.125 (.003)	0.112 (.003)	1.157 (.003)	1.269 (.014)	0.100 (.005)	0.000 (.000)	1.194 (.005)	1.223 (.015)	0.208 (.006)	0.089 (.004)

IHM	gLasso		1.185 (.004)	1.262 (.010)	0.573 (.008)	0.026 (.003)	1.255 (.004)	1.670 (.013)	0.718 (.008)	0.010 (.002)	1.186 (.004)	1.285 (.014)	0.576 (.009)	0.096 (.007)
	L₂ gMCP		1.157 (.005)	1.116 (.019)	0.156 (.015)	0.231 (.015)	1.164 (.004)	1.097 (.017)	0.152 (.014)	0.056 (.007)	1.144 (.004)	1.070 (.016)	0.161 (.016)	0.254 (.012)
	SIL-Lasso	Y	1.099 (.004)	0.838 (.018)	0.187 (.019)	0.006 (.002)	1.130 (.003)	1.011 (.013)	0.169 (.016)	0.000 (.000)	1.109 (.004)	0.916 (.014)	0.229 (.017)	0.009 (.002)
	SIL-MCP	Y	1.109 (.004)	0.911 (.020)	0.092 (.015)	0.037 (.005)	1.120 (.003)	0.923 (.014)	0.045 (.009)	0.000 (.000)	1.111 (.003)	0.935 (.014)	0.122 (.014)	0.030 (.004)
	SIL-LS	Y	1.102 (.004)	0.864 (.018)	0.119 (.017)	0.028 (.004)	1.120 (.003)	0.921 (.014)	0.066 (.013)	0.000 (.000)	1.107 (.004)	0.912 (.013)	0.115 (.015)	0.033 (.005)

IHT	sgLasso		1.194 (.004)	1.308 (.014)	0.536 (.016)	0.034 (.004)	1.266 (.004)	1.706 (.017)	0.689 (.013)	0.021 (.004)	1.194 (.004)	1.318 (.016)	0.552 (.015)	0.105 (.008)
	L₁ gMCP		1.190 (.006)	1.260 (.017)	0.052 (.005)	0.348 (.009)	1.190 (.006)	1.140 (.026)	0.065 (.005)	0.136 (.012)	1.170 (.006)	1.164 (.020)	0.068 (.006)	0.345 (.011)
	SIL-Lasso	Y	1.105 (.004)	0.852 (.016)	0.204 (.019)	0.011 (.002)	1.135 (.003)	1.044 (.014)	0.159 (.014)	0.000 (.000)	1.115 (.004)	0.936 (.014)	0.242 (.018)	0.015 (.003)
	SIL-MCP	Y	1.127 (.004)	0.991 (.019)	0.045 (.007)	0.072 (.005)	1.128 (.005)	0.957 (.028)	0.031 (.006)	0.000 (.000)	1.130 (.004)	1.022 (.015)	0.073 (.009)	0.068 (.006)
	SIL-LS	Y	1.116 (.004)	0.923 (.018)	0.061 (.009)	0.060 (.004)	1.125 (.004)	0.944 (.015)	0.046 (.010)	0.000 (.000)	1.123 (.004)	0.974 (.014)	0.084 (.010)	0.064 (.006)

Open in a new tab

In Table 1, we consider the case where all datasets have homogeneous sparsity structure, i.e., p_ht = 0. For all scenarios, the integrative approaches (IHM and IHT) tend to yield better performance than the fully heterogeneous methods (FHT), as the integrative approaches are able to take advantage of the common sparsity structure of coefficients, except when the graph information is incorporated (SRIG). Since our methods also uses the graphical knowledge, they have clearly improved performance than other existing integrative learning methods. Particularly, our three IHM methods show the best performance among all. Although our IHT versions seem to lose a little more weak signals than our IHM versions do, it is still much less serious than other existing IHT methods. This demonstrates the advantages of incorporating network information into integrative learning.

In Table 2, some datasets can have different sparsity with p_ht = 0.3. We observe similar performance patterns as in Table 1. Although all methods pose slightly worse FPR and FNR compared to Table 1 due to the heterogeneity in sparsity structure of coefficients, our methods still have substantially improved variable selection performance over the ones with no graph incorporation or the non-integrative learning methods. It is particularly worth noting that existing IHT methods show more worse results compared to the homogeneity data (Table 1) than our IHT methods do. This again confirms the advantages of incorporating network information into integrative learning.

TABLE 2.

Simulation results for heterogeneity data. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, $𝒢$ ; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L₂; average L₂ distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.

Type	Method	$𝒢$	Scenario 1				Scenario 2				Scenario 3
Type	Method	$𝒢$	MSE	L ₂	FPR	FNR	MSE	L ₂	FPR	FNR	MSE	L ₂	FPR	FNR

FHT	Lasso		1.245 (.005)	1.427 (.011)	0.295 (.005)	0.247 (.005)	1.310 (.005)	1.821 (.015)	0.357 (.006)	0.257 (.007)	1.244 (.005)	1.458 (.015)	0.278 (.005)	0.313 (.008)
	Enet		1.245 (.005)	1.427 (.011)	0.295 (.005)	0.247 (.005)	1.310 (.005)	1.821 (.015)	0.357 (.006)	0.257 (.007)	1.244 (.005)	1.458 (.015)	0.278 (.005)	0.313 (.008)
	SRIG	Y	1.135 (.004)	0.967 (.011)	0.112 (.003)	0.115 (.004)	1.132 (.004)	1.149 (.016)	0.094 (.006)	0.000 (.000)	1.164 (.005)	1.120 (.015)	0.193 (.006)	0.088 (.005)

IHM	gLasso		1.178 (.004)	1.238 (.010)	0.576 (.008)	0.030 (.004)	1.243 (.004)	1.641 (.013)	0.706 (.008)	0.016 (.003)	1.179 (.004)	1.262 (.013)	0.570 (.009)	0.099 (.007)
	L₂ gMCP		1.159 (.005)	1.123 (.021)	0.157 (.015)	0.265 (.016)	1.169 (.006)	1.152 (.024)	0.209 (.016)	0.080 (.011)	1.147 (.004)	1.090 (.018)	0.180 (.018)	0.281 (.015)
	SIL-Lasso	Y	1.105 (.005)	0.860 (.020)	0.217 (.019)	0.013 (.003)	1.137 (.004)	1.064 (.022)	0.215 (.020)	0.003 (.002)	1.113 (.004)	0.938 (.017)	0.266 (.018)	0.014 (.003)
	SIL-MCP	Y	1.113 (.004)	0.913 (.021)	0.123 (.015)	0.049 (.006)	1.127 (.005)	0.980 (.025)	0.073 (.010)	0.003 (.002)	1.113 (.004)	0.945 (.018)	0.140 (.014)	0.034 (.004)
	SIL-LS	Y	1.107 (.004)	0.884 (.020)	0.137 (.016)	0.039 (.005)	1.126 (.005)	0.970 (.025)	0.085 (.012)	0.003 (.002)	1.110 (.004)	0.926 (.017)	0.156 (.016)	0.033 (.005)

IHT	sgLasso		1.195 (.004)	1.311 (.016)	0.499 (.019)	0.061 (.007)	1.260 (.005)	1.706 (.021)	0.629 (.016)	0.051 (.008)	1.196 (.005)	1.335 (.019)	0.500 (.019)	0.145 (.012)
	L₂ gMCP		1.191 (.006)	1.246 (.020)	0.073 (.006)	0.369 (.012)	1.192 (.006)	1.197 (.029)	0.079 (.005)	0.192 (.013)	1.166 (.005)	1.164 (.023)	0.071 (.005)	0.381 (.012)
	SIL-Lasso	Y	1.107 (.005)	0.862 (.022)	0.214 (.018)	0.016 (.003)	1.128 (.004)	1.033 (.018)	0.179 (.017)	0.001 (.001)	1.119 (.006)	0.952 (.021)	0.249 (.018)	0.025 (.004)
	SIL-MCP	Y	1.129 (.004)	0.987 (.019)	0.070 (.007)	0.091 (.008)	1.126 (.005)	0.957 (.026)	0.067 (.006)	0.000 (.000)	1.133 (.005)	1.031 (.020)	0.085 (.008)	0.087 (.009)
	SIL-LS	Y	1.117 (.004)	0.913 (.020)	0.082 (.010)	0.079 (.006)	1.124 (.004)	0.970 (.020)	0.073 (.009)	0.001 (.001)	1.122 (.005)	0.976 (.020)	0.096 (.010)	0.080 (.008)

Open in a new tab

As the proposed methods rely on the graph information, we conduct the sensitivity analysis taking into account uncertainty of the graphical knowledge and inconsistency with the regression coefficients. In the sensitivity analysis, we randomly remove about 20% of edges from the true graph and use the reduced graph as a working graph. This mimics the intermediate situation where only strong interactions are known or the case where there are potentially missing edges (partial correlations) due to a screening of predictors. In Table 3, we can see the performance of the methods that use the graph information deteriorates, while the methods that do not use the graph information remain similar compared to Table 1. However, the difference for our methods is very small compared to that of SRIG, which seems attributed to the effect of integrative learning. This lends support to robustness of our methods to misspecified graphical information with missing edges.

TABLE 3.

Sensitivity analysis results. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, $𝒢$ ; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L₂; average L₂ distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.

Type	Method	$𝒢$	Scenario 1				Scenario 2				Scenario 3
Type	Method	$𝒢$	MSE	L ₂	FPR	FNR	MSE	L ₂	FPR	FNR	MSE	L ₂	FPR	FNR

FHT	Lasso		1.267 (.004)	1.508 (.009)	0.322 (.005)	0.250 (.005)	1.349 (.004)	1.925 (.014)	0.400 (.006)	0.226 (.006)	1.279 (.005)	1.565 (.014)	0.329 (.005)	0.285 (.007)
	Enet		1.267 (.004)	1.508 (.009)	0.322 (.005)	0.250 (.005)	1.349 (.004)	1.925 (.014)	0.400 (.006)	0.226 (.006)	1.279 (.005)	1.565 (.014)	0.329 (.005)	0.285 (.007)
	SRIG	Y	1.184 (.005)	1.220 (.018)	0.150 (.006)	0.166 (.006)	1.245 (.006)	1.495 (.024)	0.275 (.009)	0.158 (.007)	1.239 (.006)	1.469 (.018)	0.151 (.005)	0.199 (.007)

IHM	gLasso		1.178 (.004)	1.256 (.011)	0.583 (.008)	0.024 (.004)	1.254 (.004)	1.685 (.012)	0.717 (.007)	0.005 (.002)	1.191 (.004)	1.310 (.013)	0.607 (.009)	0.084 (.006)
	L₂ gMCP		1.148 (.004)	1.080 (.019)	0.251 (.020)	0.166 (.016)	1.161 (.004)	1.108 (.017)	0.168 (.015)	0.042 (.006)	1.146 (.005)	1.080 (.019)	0.182 (.014)	0.214 (.010)
	SIL-Lasso	Y	1.108 (.004)	0.917 (.015)	0.221 (.020)	0.006 (.002)	1.141 (.003)	1.102 (.012)	0.257 (.016)	0.002 (.001)	1.114 (.004)	0.969 (.015)	0.231 (.017)	0.017 (.004)
	SIL-MCP	Y	1.110 (.004)	0.931 (.016)	0.111 (.017)	0.032 (.005)	1.124 (.004)	0.942 (.015)	0.073 (.010)	0.006 (.002)	1.112 (.004)	0.951 (.014)	0.099 (.015)	0.060 (.007)
	SIL-LS	Y	1.106 (.004)	0.906 (.015)	0.124 (.018)	0.030 (.005)	1.122 (.003)	0.949 (.011)	0.082 (.012)	0.008 (.002)	1.107 (.004)	0.927 (.013)	0.081 (.014)	0.054 (.006)

IHT	sgLasso		1.189 (.004)	1.290 (.014)	0.576 (.016)	0.035 (.006)	1.263 (.004)	1.719 (.017)	0.688 (.012)	0.017 (.003)	1.204 (.005)	1.363 (.015)	0.556 (.016)	0.104 (.008)
	L₁ gMCP		1.187 (.007)	1.254 (.020)	0.070 (.008)	0.350 (.011)	1.179 (.006)	1.124 (.023)	0.056 (.004)	0.132 (.012)	1.164 (.005)	1.141 (.020)	0.074 (.005)	0.321 (.012)
	SIL-Lasso	Y	1.113 (.004)	0.934 (.016)	0.217 (.019)	0.015 (.003)	1.149 (.004)	1.131 (.014)	0.247 (.015)	0.014 (.002)	1.115 (.004)	0.968 (.012)	0.235 (.017)	0.020 (.004)
	SIL-MCP	Y	1.130 (.004)	1.026 (.016)	0.042 (.006)	0.105 (.007)	1.135 (.004)	0.981 (.021)	0.029 (.004)	0.047 (.005)	1.127 (.004)	1.015 (.015)	0.035 (.005)	0.121 (.009)
	SIL-LS	Y	1.120 (.004)	0.971 (.015)	0.058 (.009)	0.090 (.006)	1.137 (.003)	0.997 (.013)	0.049 (.007)	0.060 (.004)	1.122 (.004)	0.989 (.016)	0.044 (.007)	0.110 (.008)

Open in a new tab

6 |. APPLICATION

Alzheimer’s disease (AD) is a major cause of dementia. The Alzheimer’s disease neuroimaging initiative (ADNI) is a large scale multisite longitudinal study where researchers at 63 sites track the progression of AD in the human brain through the process of normal aging, early mild cognitive impairment (EMCI), and late mild cognitive impairment (LMCI) to dementia or AD. Its goal is to validate diagnostic and prognostic biomarkers that can predict the progress of AD.

In our data analysis, we investigate the association of patients’ gene expression levels with an imaging marker that captures AD progression. Specifically, we treat the fluorodeoxyglucose positron emission tomography (FDG-PET) averaged over the regions of interest (ROI) as the response variable, which measures cell metabolism. Cells affected by AD tend to show reduced metabolism. Since the association of FDG with gene expression levels may change at different stages of AD, we divide the total of 675 subjects into three groups depending on their baseline disease status, namely, CN (cognitively normal, n = 229), MCI (EMCI+LMCI, n = 402), and AD (n = 44).

The samples in each group are randomly split into a training set (50%), a validation set (25%), and a testing set (25%). For each split, we fit with our method and the existing methods considered in Section 5 plus some fully homogeneous models to check the heterogeneity of the datasets, and report the prediction errors for the testing samples. The regularization parameters of all methods are tuned by validation method and the graph information is obtained from KEGG. This procedure is repeated for 200 random splits of the data and the average squared prediction errors are reported in Table 4.

TABLE 4.

Average prediction errors for ADNI dataset. FHM; fully homogeneous models, FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, $𝒢$ ; Y indicates the method incorporates graph information.

Type	Method	$𝒢$	MCI	AD	CN

FHM	Lasso		0.926	1.094	1.020
	Enet		0.901	1.075	0.987
	SRIG	Y	0.955	1.063	1.035

FHT	Lasso		0.916	1.028	0.996
	Enet		0.881	0.983	0.961
	SRIG	Y	0.933	1.035	1.008

IHM	gLasso		0.934	1.017	0.991
	L₂ gMCP		0.898	1.027	1.005
	SIL-Lasso	Y	0.873	0.946	0.946
	SIL-MCP	Y	0.876	0.947	0.948
	SIL-LS	Y	0.879	0.948	0.945

IHT	sgLasso		0.939	1.022	0.996
	L₁ gMCP		0.914	1.045	1.001
	SIL-Lasso	Y	0.862	0.950	0.940
	SIL-MCP	Y	0.878	0.934	0.955
	SIL-LS	Y	0.881	0.941	0.949

Open in a new tab

As shown in Table 4, all of the FHM methods tend to underperform the corresponding FHT methods, suggesting that the model of interest likely has different parameters for different groups. Despite such heterogeneity, our methods show best prediction performance for all groups. The existing integrative learning approaches, which do not incorporate network information, seem to have difficulty integrating information from different datasets.

Another benefit of incorporating graphical pathway information is enhanced interpretability of the selected genes. To confirm, we conduct the pathway enrichment analysis based on the 30 most frequently selected genes of each method during the 200 repeats. Table 5 include 10 enriched pathways that are related to Alzheimer disease and the associated p-values. Any method that does not incorporate graph information, including the existing integrative learning approaches, has no enriched pathway. Except SILs, only the fully heterogeneous SRIG yields some enriched pathways. However, the p-values of SRIG tend to be larger than those of our methods.

TABLE 5.

Ten enriched pathways and p-values for each method. ‘-’ indicates not enriched in the genes selected by the method. FHM; fully homogeneous models, FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, $𝒢$ ; Y indicates the method incorporates graph information, P1; AGE-RAGE signaling pathway, P2; Angiopoietin receptor Tie2-mediated signaling, P3; Chemokine signaling pathway, P4; CXCR4-mediated signaling events, P5; Glucocorticoid receptor regulatory network, P6; IL2-mediated signaling events, P7; MAPKinase Signaling Pathway, P8; Prolactin signaling pathway, P9; Signaling by PDGF, P10; Tuberculosis.

Type	Method	$𝒢$	P1	P2	P3	P4	P5	P6	P7	P8	P9	P10

FHM	Lasso		-	-	-	-	-	-	-	-	-	-
	Enet		-	-	-	-	-	-	-	-	-	-
	SRIG	Y	-	-	-	-	-	-	-	-	-	-

FHT	Lasso		-	-	-	-	-	-	-	-	-	-
	Enet		-		-	-	-	-	-		-	-
	SRIG	Y	1.6e-5	5.8e-5	-	-	2.9e-4	7.3e-5	2.8e-4	1.8e-4	-	-

IHM	gLasso		-	-	-	-	-	-	-	-	-	-
	L₂ gMCP		-	-	-	-	-		-	-		-
	SIL-Lasso	Y	2.1e-9	4.2e-6	-	9.7e-8	1.1e-6	5.8e-6	1.1e-6	-	1.6e-6	2.9e-6
	SIL-MCP	Y	1.1e-6	1.9e-6	2.1e-5	1.2e-6	1.6e-5	4.1e-8	1.6e-5	1.9e-7	4.9e-6	1.9e-5
	SIL-LS	Y	1.3e-6	1.1e-4	-	-	2.0e-5	1.5e-4	1.1e-4	1.0e-5	8.1e-5	-

IHT	sgLasso		-	-	-	-	-	-	-	-	-	-
	L₁ gMCP		-		-	-	-	-		-	-	-
	SIL-Lasso	Y	5.8e-11	1.3e-9	7.7e-9	1.3e-12	1.3e-6	1.3e-7	1.3e-6	-	-	1.7e-7
	SIL-MCP	Y		-	2.1e-4	2.6e-4	-	1.5e-6	-	5.0e-6	2.7e-5	2.0e-4
	SIL-LS	Y	1.3e-6	1.1e-4	-	4.4e-5	-	1.5e-4	1.1e-4	1.0e-5	-	-

Open in a new tab

7 |. DISCUSSION

We have proposed a novel integrative learning method, called SIL, which can incorporate the graphical structure of features. SIL possesses appealing theoretical properties, is scalable to high-dimensional data, and has been shown to outperform existing integrative learning methods through a simulation study and a real data analysis.

In practice, the ground truth sparsity structure of β⁰ may not be consistent with the graphical structure. However, when the discrepancy is moderate, our proposed method will still show reasonably good performance by detecting the subset of groups that cover all or most of the nonzero coefficients. Note that the sensitivity analysis (Table 3), which is conducted partly in consideration of such inconsistency, suggests the proposed method is quite robust. Even when the graphical information is completely irrelevant to the sparsity structure, our method will not fail. The tuning procedure will discourage the group-wise selection and we can expect the performance to be comparable to that of the plain ridge regression.

On the other hand, it is widely acknowledged that the graph information obtained from existing databases could be inaccurate or incomplete. It is potentially of future interest to investigate approaches that are more robust to incomplete graph information. One potential approach is to combine the graph information from existing databases and the estimated graph information using the data being analyzed. Another direction for future research is to incorporate graph information that may vary between datasets.

Supplementary Material

SUPINFO

NIHMS1844882-supplement-SUPINFO.pdf^{(220.9KB, pdf)}

ACKNOWLEDGMENTS

This work is partly supported by NIH grant RF1AG063481. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The complete ADNI Acknowledgement is available here (click).

Footnotes

Conflict of interest

The authors declare no potential conflict of interests.

Financial disclosure

None reported.

SUPPORTING INFORMATION

The supplementary material available online includes additional algorithms and the proof of theorems.

References

[1].Armagan A, Dunson DB, and Lee J, 2013: Generalized double pareto shrinkage. Statistica Sinica, 23, no. 1, 119–143. [PMC free article] [PubMed] [Google Scholar]
[2].Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, and Sherlock G, 2000: Gene ontology: tool for the unification of biology. Nature Genetics, 25, no. 1, 25–29, doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
[3].Beck A and Teboulle M, 2009: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, no. 1, 183–202. [Google Scholar]
[4].Bühlmann P and van de Geer S, 2011: Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer. [Google Scholar]
[5].Bickel PJ, Ritov Y, and Tsybakov AB, 2009: Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37, no. 4, 1705–1732, doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620 [DOI] [Google Scholar]
[6].Candès EJ, Wakin MB, and Boyd SP, 2008: Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14, no. 5, 877–905, doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]
[7].Chang C, Kundu S, and Long Q, 2018: Scalable bayesian variable selection for structured high-dimensional data. Biometrics, 74, no. 4, 1372–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Chang C, Oh J, and Long Q, 2020: Gria: Graphical regularization for integrative analysis. Proceedings of the 2020 SIAM International Conference on Data Mining, 604–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
[9].Fan J and Li R, 2001: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, no. 456, 1348–1360, doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
[10].Gong P, Zhang C, Lu Z, Huang JZ, and Ye J, 2013: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, JMLR.org, ICML’13, II–37–II–45. URL http://dl.acm.org/citation.cfm?id=3042817.3042898 [PMC free article] [PubMed] [Google Scholar]
[11].Huang Y, Zhang Q, Zhang S, Huang J, and Ma S, 2017: Promoting similarity of sparsity structures in integrative analysis with penalization. Journal of the American Statistical Association, 112, no. 517, 342–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
[12].Jacob L, Obozinski G, and Vert J-P, 2009: Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA, ICML ’09, 433–440. [Google Scholar]
[13].Kanehisa M, Furumichi M, Tanabe M, Sato Y, and Morishima K, 2017: Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45, no. D1, D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Li C and Li H, 2008: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics, 24, no. 9, 1175–1182. [DOI] [PubMed] [Google Scholar]
[15].Li F and Zhang NR, 2010: Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics. Journal of the American Statistical Association, 105, no. 491, 1202–1214. [Google Scholar]
[16].Li Q, Wang S, Huang C-C, Yu M, and Shao J, 2014: Meta-analysis based variable selection for gene expression data. Biometrics, 70, no. 4, 872–880. [DOI] [PubMed] [Google Scholar]
[17].Li Z, Chang C, Kundu S, and Long Q, 2020: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics, 21, no. 3, 610–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Liu B, Wu C, Shen X, and Pan W, 2017: A novel and efficient algorithm for de novo discovery of mutated driver pathways in cancer. The annals of applied statistics, 11, no. 3, 1481. [DOI] [PMC free article] [PubMed] [Google Scholar]
[19].Liu J, Huang J, and Ma S, 2013: Incorporating network structure in integrative analysis of cancer prognosis data. Genetic Epidemiology, 37, no. 2, 173–183. [DOI] [PMC free article] [PubMed] [Google Scholar]
[20].Liu J, Huang J, Zhang Y, Lan Q, Rothman N, Zheng T, and Ma S, 2014: Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics, 70, no. 3, 480–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
[21].Liu J, Ma S, and Huang J, 2014: Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics, 41, no. 1, 87–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
[22].Ma S, Huang J, and Song X, 2011: Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics, 12, no. 4, 763–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
[23].Pan W, Xie B, and Shen X, 2010: Incorporating predictor network in penalized regression with application to microarray data. Biometrics, 66, no. 2, 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
[24].Stingo FC, Chen YA, Tadesse MG, and Vannucci M, 2011: Incorporating Biological Information into Linear Models: A Bayesian Approach to the Selection of Pathways and Genes. Annals of Applied Statistics, 5, no. 3, 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
[25].Tibshirani R, 1996: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, no. 1, 267–288. [Google Scholar]
[26].Yu G and Liu Y, 2016: Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111, no. 514, 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]
[27].Zhang C-H, 2010: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38, no. 2, 894–942, doi: 10.1214/09-AOS729. URL 10.1214/09-AOS729 [DOI] [Google Scholar]
[28].Zhao P and Yu B, 2006: On model selection consistency of lasso. J. Mach. Learn. Res, 7, 2541–2563. [Google Scholar]
[29].Zhao Q, Shi X, Huang J, Liu J, Li Y, and Ma S, 2015: Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdisciplinary Reviews: Computational Statistics, 7, no. 1, 99–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
[30].Zhao Y, Chang C, and Long Q, 2019: Knowledge-guided statistical learning methods for analysis of high-dimensional -omics data in precision oncology. JCO Precision Oncology, no. 3, 1–9, doi: 10.1200/PO.19.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
[31].Zou H, 2006: The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, no. 476, 1418–1429. [Google Scholar]
[32].Zou H and Hastie T, 2005: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, no. 2, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPINFO

NIHMS1844882-supplement-SUPINFO.pdf^{(220.9KB, pdf)}

[R1] [1].Armagan A, Dunson DB, and Lee J, 2013: Generalized double pareto shrinkage. Statistica Sinica, 23, no. 1, 119–143. [PMC free article] [PubMed] [Google Scholar]

[R2] [2].Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, and Sherlock G, 2000: Gene ontology: tool for the unification of biology. Nature Genetics, 25, no. 1, 25–29, doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] [3].Beck A and Teboulle M, 2009: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, no. 1, 183–202. [Google Scholar]

[R4] [4].Bühlmann P and van de Geer S, 2011: Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer. [Google Scholar]

[R5] [5].Bickel PJ, Ritov Y, and Tsybakov AB, 2009: Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37, no. 4, 1705–1732, doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620 [DOI] [Google Scholar]

[R6] [6].Candès EJ, Wakin MB, and Boyd SP, 2008: Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14, no. 5, 877–905, doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]

[R7] [7].Chang C, Kundu S, and Long Q, 2018: Scalable bayesian variable selection for structured high-dimensional data. Biometrics, 74, no. 4, 1372–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Chang C, Oh J, and Long Q, 2020: Gria: Graphical regularization for integrative analysis. Proceedings of the 2020 SIAM International Conference on Data Mining, 604–612. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] [9].Fan J and Li R, 2001: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, no. 456, 1348–1360, doi: 10.1198/016214501753382273. [DOI] [Google Scholar]

[R10] [10].Gong P, Zhang C, Lu Z, Huang JZ, and Ye J, 2013: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, JMLR.org, ICML’13, II–37–II–45. URL http://dl.acm.org/citation.cfm?id=3042817.3042898 [PMC free article] [PubMed] [Google Scholar]

[R11] [11].Huang Y, Zhang Q, Zhang S, Huang J, and Ma S, 2017: Promoting similarity of sparsity structures in integrative analysis with penalization. Journal of the American Statistical Association, 112, no. 517, 342–350. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] [12].Jacob L, Obozinski G, and Vert J-P, 2009: Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA, ICML ’09, 433–440. [Google Scholar]

[R13] [13].Kanehisa M, Furumichi M, Tanabe M, Sato Y, and Morishima K, 2017: Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45, no. D1, D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Li C and Li H, 2008: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics, 24, no. 9, 1175–1182. [DOI] [PubMed] [Google Scholar]

[R15] [15].Li F and Zhang NR, 2010: Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics. Journal of the American Statistical Association, 105, no. 491, 1202–1214. [Google Scholar]

[R16] [16].Li Q, Wang S, Huang C-C, Yu M, and Shao J, 2014: Meta-analysis based variable selection for gene expression data. Biometrics, 70, no. 4, 872–880. [DOI] [PubMed] [Google Scholar]

[R17] [17].Li Z, Chang C, Kundu S, and Long Q, 2020: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics, 21, no. 3, 610–624. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Liu B, Wu C, Shen X, and Pan W, 2017: A novel and efficient algorithm for de novo discovery of mutated driver pathways in cancer. The annals of applied statistics, 11, no. 3, 1481. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] [19].Liu J, Huang J, and Ma S, 2013: Incorporating network structure in integrative analysis of cancer prognosis data. Genetic Epidemiology, 37, no. 2, 173–183. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] [20].Liu J, Huang J, Zhang Y, Lan Q, Rothman N, Zheng T, and Ma S, 2014: Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics, 70, no. 3, 480–488. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] [21].Liu J, Ma S, and Huang J, 2014: Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics, 41, no. 1, 87–103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] [22].Ma S, Huang J, and Song X, 2011: Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics, 12, no. 4, 763–775. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] [23].Pan W, Xie B, and Shen X, 2010: Incorporating predictor network in penalized regression with application to microarray data. Biometrics, 66, no. 2, 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] [24].Stingo FC, Chen YA, Tadesse MG, and Vannucci M, 2011: Incorporating Biological Information into Linear Models: A Bayesian Approach to the Selection of Pathways and Genes. Annals of Applied Statistics, 5, no. 3, 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] [25].Tibshirani R, 1996: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, no. 1, 267–288. [Google Scholar]

[R26] [26].Yu G and Liu Y, 2016: Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111, no. 514, 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] [27].Zhang C-H, 2010: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38, no. 2, 894–942, doi: 10.1214/09-AOS729. URL 10.1214/09-AOS729 [DOI] [Google Scholar]

[R28] [28].Zhao P and Yu B, 2006: On model selection consistency of lasso. J. Mach. Learn. Res, 7, 2541–2563. [Google Scholar]

[R29] [29].Zhao Q, Shi X, Huang J, Liu J, Li Y, and Ma S, 2015: Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdisciplinary Reviews: Computational Statistics, 7, no. 1, 99–108. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] [30].Zhao Y, Chang C, and Long Q, 2019: Knowledge-guided statistical learning methods for analysis of high-dimensional -omics data in precision oncology. JCO Precision Oncology, no. 3, 1–9, doi: 10.1200/PO.19.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] [31].Zou H, 2006: The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, no. 476, 1418–1429. [Google Scholar]

[R32] [32].Zou H and Hastie T, 2005: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, no. 2, 301–320. [Google Scholar]

PERMALINK

Integrative Learning of Structured High-Dimensional Data from Multiple Datasets

Changgee Chang

Zongyu Dai

Jihwan Oh

Qi Long

Summary

1 |. INTRODUCTION

2 |. METHOD

2.1 |. Background

2.2 |. Structured Integrative Learning

3 |. ALGORITHM

Proposition 1.

Proposition 2.

4 |. THEORETICAL PROPERTIES

Assumption 1.

Assumption 2.

Assumption 3.

Theorem 1.

Theorem 2.

Remark 1.

Theorem 3.

Assumption 4.

Assumption 5.

Assumption 6.

Assumption 7.

Corollary 1.

Corollary 2.

5 |. SIMULATION

TABLE 1.

TABLE 2.

TABLE 3.

6 |. APPLICATION

TABLE 4.

TABLE 5.

7 |. DISCUSSION

Supplementary Material

ACKNOWLEDGMENTS

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases