Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 1.
Published in final edited form as: Stat Anal Data Min. 2022 Nov 8;16(2):120–134. doi: 10.1002/sam.11601

Integrative Learning of Structured High-Dimensional Data from Multiple Datasets

Changgee Chang 1,*, Zongyu Dai 2, Jihwan Oh 1, Qi Long 1,*
PMCID: PMC10195070  NIHMSID: NIHMS1844882  PMID: 37213790

Summary

Integrative learning of multiple datasets has the potential to mitigate the challenge of small n and large p that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.

Keywords: integrative learning, horizontally partitioned data, knowledge-guided learning, network-based penalty, high-dimensional data

1 |. INTRODUCTION

Massive amounts of high-throughput -omics data that have been generated in recent studies offer great promises in deepening our understanding molecular underpinning and mechanisms for complex diseases such as Alzheimer’s disease and cancer. At the same time, they still present significant analytical challenges as the sample size in a single study is often small to moderate. There is a large body of literature on regularized regression models for the analysis of high-dimensional data in the setting where the number of variables is larger than the sample size. While many of these methods have appealing asymptotic properties, there is a growing recognition that their performance in practice is often unsatisfactory when the sample size is small and the signal-to-noise ratio is small. A number of approaches have been proposed to mitigate this limitation of regularized regressions, particularly for the analysis of genomics data.

One popular approach is to incorporate prior knowledge on high-dimensional predictors such as gene regulatory pathways and co-expression networks that are represented by graphs and can be obtained from public or commercial databases such as Kyoto Encyclopedia of Genes and Genomes (KEGG, Kanehisa et al. [13]) and Gene Ontology [2]. The knowledge-guided approach for structured data whose variables lie on a graph [15] has been adopted in supervised learning such as regression [14, 23, 26, 7] and in unsupervised learning [17, 18], through carefully designed penalty functions in a frequentist framework or prior specifications in a Bayesian framework. The rationale behind incorporating the graphical structure of features into supervised learning is the fact that phenotypic biomarkers are often manifested as a result of interaction between a group of genes (pathway). It is typically not the case that the important features are unrelated. Rather, one or more groups of closely related genes have the predictive power jointly. Therefore, the graphical information can be integrated by encouraging the group-wise selection of the model coefficients. For example, Li and Li [14], Pan et al. [23] propose network-based penalties which encourage joint selection of the predictors that are connected in the graph. Li and Zhang [15], Stingo et al. [24] use a Markov random field (MRF) prior combined with a spike and slab prior to encourage selection of connected predictors. More recently, Chang et al. [7] propose a structured shrinkage prior which mitigates some issues associated with prior Bayesian methods. While all the aforementioned methods use the predictor graph in an edge-by-edge manner and encourage selection of adjacent nodes, [26] proposed a method that uses the predictor graph in a node-by-node manner and encourage selection of the neighborhood group of each predictor. These knowledge-guided statistical learning methods have shown improved prediction accuracy and improved power for detecting weak yet important signals in finite samples and they encourage selection of pathways rather than individual features, leading to biologically more meaningful and interpretable results [30].

Another useful approach for mitigating the small sample size problem is integrative learning of multiple datasets that contain the same set of variables, also known as horizontally partitioned data. Multiple datasets are broadly defined as being collected from multiple studies/sites, from multiple sex/racial groups, or from multiple related disease groups. One key advantage of integrative learning in regression is that it improves the power for detecting important predictors that are shared across the datasets [29]. Ma et al. [22] proposed an integrative analysis approach that assumes the same sparsity structure of the regression coefficients across all datasets but allows for different effect sizes. The homogeneous sparsity assumption, however, can be overly restrictive in some applications. This assumption is relaxed in subsequent work by, among others, Li et al. [16], Liu et al. [20 21], Huang et al. [11] which allow for heterogeneity in sparsity structure across multiple datasets. However, these existing integrative learning methods do not account for important graph information for structured predictors such as genomics data, which has the potential to further improve the power of detecting weak yet important signals. Thus the existing heterogeneity models, as they allow the coefficient of a selected feature in some datasets to become zero, may miss such weak, yet important signals, weakening the power of integrative learning. To the best of our knowledge, there has been little work on incorporating graph information into integrative learning except for [19], and their approach relies on the assumption of homogeneous sparsity across all datasets, which may be unrealistic in many applications. For example, when performing integrative learning of datasets from populations at different stages of a disease, the set of important predictors may vary across these datasets.

To address this gap, we propose a novel integrative learning approach, called Structured Integrative Learning (SIL), which enables incorporation of structural information such as graphical knowledge on predictors. The key idea underlying SIL is that if a group of features as defined by pathways/networks are important in one dataset, they are likely to be important for the other datasets as well. Our approach is designed to select ‘groups of features’ jointly for all datasets rather than selecting ‘individual features’ jointly. As such, it is expected to further improve the power of detecting weak, yet important signals. Our proposed method can accommodate both homogeneous and heterogeneous sparsity structure, and in particular our method is theoretically justifiable. We show the oracle inequalities that provide the upper bounds of estimation and prediction errors in a non-asymptotic manner. We also investigate the conditions for the oracle property to hold in the setting where both the number of datasets and the number of predictors diverge. Another contribution of our work is to develop an iterative shrinkage-thresholding algorithm [3, 10] that fits our model, which is much more scalable than the (sub)gradient descent algorithm that has typically been used in prior work for integrative learning. We show that the proximal operators associated with our regularizers have analytic solutions and can be evaluated very efficiently.

We note that Chang et al. [8] has presented the intermediate results of our research on the proposed method. Compared to the earlier version, this work includes several significant improvements as follows. The current work presents a more general penalty formulation of which the penalty in the prior version can be viewed as a special case. Theoretical properties of the general penalty are rigorously investigated, while the earlier version includes no theoretical result. We include new regularizers that are based on log-sum penalty and the efficient algorithms for them. The regularizers in the prior paper are still included and compared in the simulation and data analysis studies. In the simulation study, we compare performance using both homogeneity data and heterogeneity data, while the earlier version only uses heterogeneity data. Moreover, we perform a sensitivity analysis which investigates robustness of our method against inaccurate and/or incomplete graphical information, while no sensitivity analysis in the earlier work. Finally, we include a pathway enrichment analysis in data analysis, which demonstrates our method yields outcomes that can be more interpretable and biological meaningful, while the prior work included no pathway enrichment analysis.

The remainder of this article is organized as follows. We describe the problem of interest and present our proposed method in Section 2, and then present the numerical algorithm in Section 3. In Section 4, we present the theoretical properties of the proposed method. In Section 5, we conduct simulation studies to evaluate the performance of our approach in comparison with several existing methods. In Section 6, we further illustrate the strengths of the proposed method through analysis of real data from the Alzheimer’s Disease Neuroimaging Initiative (ADNI). We conclude the paper with some discussion remarks in Section 7.

2 |. METHOD

2.1 |. Background

To fix ideas, consider fitting linear regression model using data from M datasets. In the m-th dataset, we have an nm ×p predictor matrix Xm and an nm × 1 response vector ym, where nm is the sample size of the m-th dataset and p is the number of predictors. Let N=mnm be the total sample size. The model of interest is the linear model

ym=Xmβ0m+em,m=1,...,M, (1)

where β0m=(β10m,...,βp0m)T is the p × 1 true coefficient vector and em𝒩(0,σ2I) is the nm × 1 error vector for the m-th dataset. The regularized least square loss function is generally given by

(B)=m=1MLm(βm)+P(B),

where B = [β1βM] is the p × M coefficient matrix, P(B) is a penalty on B, and

Lm(βm)=12nmymXmβm22.

The most general estimator would allow all βm to be different and use a separable penalty P(B)=m=1MPm(βm). This is equivalent to independently minimizing

m(βm)=Lm(βm)+Pm(βm),

which is equivalent to analyze each dataset separately. We call this model the fully heterogeneous model. On the other hand, the least general estimator would assume all coefficients to be the same across all datasets; βmβ. This fully homogeneous model is equivalent to merging all datasets with each data point weighted by the reciprocal of the size of the dataset it belongs to. The weights prevent the large dataset from dominating the loss function and keep the coefficients from leaning favorably only to the large dataset.

Obviously, the full homogeneity assumption can be overly restrictive. Each dataset often has its own characteristics, and the association between the outcome and the predictors can be different. Ignoring the difference can lead to poor or suboptimal performance in estimation and prediction. On the other hand, the fully heterogeneous model fails to borrow information across datasets, and the regression model for each dataset can suffer the curse of dimensionality. The motivation of integrative learning is to aggregate common information from multiple datasets while accounting for heterogeneity across these datasets.

2.2 |. Structured Integrative Learning

In this work, we focus on the case where the graph information for predictors is the same across the M datasets. Denote by G = ⟨𝑉,E⟩ the graph on predictors X where 𝑉 = {1, ... , p} is the set of features and E is the set of edges between the features. Let A = [𝑎jk] be the adjacency matrix associated with G and let 𝒜j={k:ajk=1}{j} be the neighborhood of the j-th feature including itself. Let e = |E| be the number of edges in G and aj=|𝒜j| be the number of members in 𝒜j. The graphical information on features often represents the partial correlation structure of the features. That is, the presence of an edge between features j and k implies the (j, k) entry of the feature precision matrix is nonzero, while an absence of edge means a zero entry. In analysis of genomics data, such graphs often represent gene regulatory pathways or co-expression networks which can be obtained from existing databases such as KEGG [13].

For Model (1), note that we have βm = Ωmcm where Ωm = nmE(Xm𝑇Xm)−1 and cm=1nmE(XmTym). This yields

βm=j=1pcjmωjm, (2)

where ωjm is the j-th column of Ωm. Since the absence of an edge implies a zero partial correlation, we have supp(ωjm)=𝒜j. Following the observation from Yu and Liu [26], either β𝒜jm(βkm)k𝒜j can have nonzero values (if cjm0), or there will be no contribution from the j-th group to the effect size βm (if cjm=0). The key premise of our work is that if a group 𝒜j is important for one dataset, it is likely to be important for other datasets as well. To encourage joint selection of the feature groups 𝒜j across all datasets, we propose the following penalty.

P(B)=λBτ,BτinfB=jΓj,supp(γjm)=𝒜jj=1pτjρ(Γj), (3)

where Γj=[γj1γjM] in an arbitrary p × M matrix.

Note that the standard group lasso penalty such as P(B)=λj=1pτjB𝒜jF where B𝒜j=[β𝒜j1β𝒜jM] and ‖ · ‖F is the Frobenius norm has an undesirable characteristic. Since 𝒜j are overlapping in general, it yields βjm0 if and only if all groups including j are selected. In other words, gene j can be selected if and only if all groups (pathways) including j are important, which is not plausible in biology. On the other hand, the proposed penalty (3) introduces latent coefficients Γj and can yield βjm0 if at least one group that contains j is selected. That is, if a pathway is important, then all genes in the pathway can become important. This is a required property to be consistent with (2). More details about the latent group lasso penalty can be found in Jacob et al. [12]. It is easy to see that applying the proposed penalty is equivalent to replacing βm with jγjm in the loss function and enforcing a penalty on Γ = (Γ1, ... , Γp) as follows.

L(Γ)=m=1M12nmymXmj=1pγjm22,P(Γ)=λj=1pτjρ(Γj),supp(γjm)=𝒜j.

Minimizing L(Γ) + P(Γ) will yield the same solution as minimizing L(B) + P(B) with β^m=jγ^jm. The efficient algorithms for selected penalties will be presented in Section 3.

We propose the core penalty function ρj) has the following form.

ρ(Γj)=ρ1ρ2(Γj). (4)

The outer penalty ρ1:++ determines the sparsity and unbiasedness levels and the positive halves of all popular concave penalty functions can be used; for example, lasso [25], SCAD [9], MCP [27], and log-sum [6, 1]. The inner penalty ρ2:p×M+ determines how the penalties of the columns γjm are combined, and yields different type of models. The choice of ρ2(Γ) = ‖Γ‖F selects all components in an all-in-or-all-out fashion and leads to the homogeneity model, but the choice of ρ2(Γ) = ‖Γ𝑇2,1 (L2,1 norm) allows each γjm to become zero and leads to the heterogeneity model. The combination of both ρ2(Γ) = α‖Γ‖F + (1 − α)‖Γ𝑇2,1 can also be considered.

For example, we can choose the log-sum penalty for ρ1.

ρLS(x)=ηlog(1+x/η). (5)

Then, depending on the choice for ρ2, we can have two types of penalties

PLS1(Γ)=λj=1pτjρLS(ΓjF),PLS2(Γ)=λj=1pτjρLS(ΓjT2,1). (6)

In both choices, the entries of γjm becomes zero or nonzero in an all-in-or-all-out fashion. The weights τj should take into account the size of the group 𝒜j and thus is recommended to have the form τj=ajdj for some dj > 0. We can have homogeneous penalty (dj = 1), or we can reflect the importance of the group 𝒜j by choosing adaptive penalty dj. Noting that, for example, the coefficients are proportional to the absolute correlation between the response variable and the predictor j as in (2), dj can be chosen to be reciprocal to the average of the absolute sample correlations dj1=M1|mxjmTem/nm|.

Note that the L1 or L2 norm based penalty discourages to include highly correlated variables/groups simultaneously. In other words, true signals can be pushed away from the model by its highly correlated neighbors, and it is the case in our model as well. Due to the construction of 𝒜j, some 𝒜j‘s are often similar or can even be identical. As a remedy to this issue, we follow the idea of the elastic net [32], which adds the ridge penalty.

PR(B)=λR2BF2.

The ridge penalty seemingly reduces the correlations between groups and facilitates inclusion of all potentially important signals. This completes the objective function of our model.

(B)=L(B)+PR(B)+P(B),

or equivalently

(Γ)=L(Γ)+PR(Γ)+P(Γ),supp(γjm)=𝒜j,

where

PR(Γ)=j=1pλRaj2ΓjF2.

Many existing penalties can be viewed as a special case of ours. If we have only one dataset and we choose ρ1(x) = x and ρ2(Γ) = ‖Γ‖F, then our method simplifies to the sparse regression incorporating graphical structure among predictors [26]. If no edge exists or no graphical information is available, we have 𝒜j={j} for all j. In this case, if we choose the minimax concave penalty for ρ1, our method reduces to the methods introduced in Liu et al. [21]. If we choose ρ1(x) = x and the Frobenius norm for ρ2, our model simplifies to the integrative analysis model with the group lasso penalty proposed by Ma et al. [22].

3 |. ALGORITHM

In this section, we present an efficient algorithm to fit our model with the log-sum penalties defined in (6), which will enhance its usefulness in analysis of high-dimensional data such as genomics data. We also consider two other penalties in this work. Instead of (5), we can use the MCP penalty [27] for ρ1.

ρMCP(x)=0x(1u/(λη))+du. (7)

Or, we can have the convex penalty

ρ1(x)=x,ρ2(Γ)=αΓF+(1α)ΓT2,1. (8)

The algorithms for (7) and (8) can be found in Chang et al. [8].

Let δm=(δ1mT,...,δpmT)T and Δj=[δj1δjM] where δjm=γ𝒜j,jm is the vector of unconstrained coefficients in γjm. Let Zjm be the submatrix of Xm including the columns corresponding to 𝒜j and let Zm=[Z1mZpm]. Denoting Δ = (Δ1, ... , Δp), our objective function can be decomposed into a differentiable part L(Δ)+ PR(Δ) and a non-differentiable part PLS2(Δ) or PLS1(Δ) where

L(Δ)=m=1M12nmymZmδm22,PR(Δ)=j=1pλRaj2ΔjF2,
PLS1(Δ)=j=1pλητjlog(1+ΔjF/η),PLS2(Δ)=j=1pλητjlog(1+ΔjT2,1/η).

We use the accelerated proximal gradient descent algorithm (FISTA, Beck and Teboulle [3]) to fit our models. While the log-sum penalty is not convex, its second derivative is bounded from below and it satisfies the criteria in Gong et al. [10]. Propositions 1 and 2 describe how to evaluate the proximal operators for PLS1 and PLS2, respectively. Let Δ˜ be the proximal operator associated with penalty P(Δ) evaluated at Δ, as defined below.

Δ˜proxt(Δ)argminW=(W1,...,Wp)(12tj=1pWjΔjF2+P(W)).

Proposition 1.

For t < η∕(λ maxj τj), the proximal operator associated with the penalty PLS1(Δ) is given by

Δ˜j=(1λtτjhj/ΔjF)+Δj,j=1,...,p, (9)

where

hj=1+ΔjF/η(1+ΔjF/η)24λtτj/η2λtτj/η. (10)

Proposition 2.

For t < η∕(λ maxj τj), the proximal operator associated with the penalty PLS2(Δ) is given by

δ˜jm=(1λtτjhj/δjm2)+δjm, (11)

where hj satisfies

hj=11+l=1M(δjl2λtτjhj)+/η. (12)

The proofs for Propositions 1 and 2 are included in Web Appendix A. Note that (12) is a piecewise quadratic equation in hj whose analytic solution can be easily obtained as follows. Let ξjm=δjm2/(λtτj). Equation (12) can be rewritten as

hj=11+λtτjl=1M(ξjlhj)+/η. (13)

Sort ξj1,...,ξjM in ascending order (𝒪(MlogM)) and assume, for simplicity, that

0=ξj0ξj1ξjK<1ξjM,

for some KM. First, note that hj = 1 if and only if ξjM1. Suppose ξjM>1 and hj[ξjk1,ξjk). From (13), we have the candidate solution hkj as follows.

hjk=1+λtτjl=kMξjl/η(1+λtτjl=kMξjl/η)24λtτj(Mk+1)/η2λtτj(Mk+1)/η.

If hjk[ξjk1,ξjk) for some k ∈ {1, ... , K}, it is indeed the solution for (12). Otherwise, hjK+1[ξjK,1) is the solution for (12).

The algorithm uses the standard accelerated proximal gradient descent algorithm with the backtracking line search. Each iteration requires 𝒪(pN+Me) for PLS1 and 𝒪(pN+Me+pMlogM) for PLS2. We have also investigated the non-accelerated proximal gradient descent algorithm and found that the accelerated version has a substantial advantage when the sample size is small and the ridge penalty λR is 0 or close to 0.

4 |. THEORETICAL PROPERTIES

In this section, we study the theoretical properties of the proposed method. The main goal is to provide the conditions for which the oracle inequality and the oracle property hold in the context of integrative analysis. Although the Theorem statements and the proofs may look similar to those in Yu and Liu [26], the implications apply to the analysis of a large number of datasets. Also, note that the result of n-consistency presented here (Theorem 3) is more general than that of Yu and Liu [26], as the oracle property therein is discussed with fixed p only.

Let J0m={j:βj0m0} be the set of important variables of the mth dataset and J0=m=1MJ0m be the union of all important variables. Define s0 = |J0| as the number of all important variables. Let J1={jJ0:𝒜jJ0} be the set of groups which contain important features only, J2={j:𝒜jJ0} be the set of groups which contain at least one important gene, and J3={jJ0c:𝒜jJ0c} be the set of groups which contain unimportant features only, and let s1 = |J1|, s2 = |J2|, and s3 = |J3|. Focusing on general penalties with ρ(⋅) ≥ ‖ ⋅ ‖F, we first present the oracle inequalities under homogeneous penalty weights dj = 1, i.e., τj=aj for j = 1, ... , p. These are non-asymptotic finite sample properties which account for a diverging number of datasets and predictors. Then, we discuss the model selection consistency and the asymptotic normality under adaptive penalty weights dj. To this end, define d=maxjJ1dj and d*=minjJ1cdj. Noting that the ridge penalty is not required for the oracle properties to hold and only needed to be small enough, we set λR = 0 in this section.

For simplicity, we assume nm = n for m = 1, ... , M and thus N = Mn, and define

y=1n[y1yM],X=1ndiag(X1,...,XM),β=vec(B),e=1n[e1eM].

We vectorize Γj in this section, γj = vec(Γj), and the penalty is written as

βτ=minβ=j=1pγjj=1pτjρ(γj),supp(γjm)=𝒜j,

with a little abuse of notation; ρ(γj) ≡ ρj). Then, the objective function goes as follows.

12yXβ22+λj=1pβτ. (14)

Let β^ be the minimizer of (14) and γ^1,...,γ^p be an optimal decomposition of β^.

We present the oracle inequalities for estimation and prediction errors. Let Q1=maxm,jxjm22/n be the largest empirical variance of predictors. Let β0 = vec(B0) be the stacked true regression coefficients and βJ00 be the stacked nonzero true regression coefficients. Let β*0 be the smallest absolute value of nonzero true coefficients across all datasets. For βp, let U(β)={(γ1,...,γp):jγj=β,βτ=jτjρ(γj),supp(γjm)=𝒜j}, be the set of all optimal decompositions of β, and Kτ(β), be the number of nonzero γj’s in the optimal decomposition of β which has the minimal number of nonzero γj ‘s, i.e., Kτ(β)=minΓU(β)|{j:γj0}|. Denote Kτ=supsupp(β)(jJ2𝒜j)Kτ(β). We can check J1 = J0, J2 = J0, J3=J0c, and Kτ = s0 if the graph G has no edge. We need the following assumptions.

Assumption 1.

The important feature set J0 is covered by {𝒜j:jJ1}. That is, jJ1𝒜j=J0.

Assumption 2.

The errors emiid𝒩(0,σ2I) for m = 1, ... , M.

Assumption 3.

There exists a constant κ > 0 such that

inf|J|s2,βρ\{0}infΓ𝒯τ(β,J)Xβ2jJτj2ρ(γj)2κ,

where 𝒯τ(β,J) is the set of all optimal decompositions Γ = (γ1, ... , γp) of β such that jJcτjρ(γj)3jJτjρ(γj).

In order to select the correct model, the groups that include any unimportant variable must not be selected and only the groups that have important variables only may be selected. Assumption 1 ensures that all important variables are covered by the groups with important variables only. Although we assume Gaussian errors in Assumption 2, the asymptotic properties presented in this paper hold for any iid mean zero sub-Gaussian errors. Assumption 3 is similar to the restricted eigenvalue condition or the compatibility condition [5] which is commonly used for these types of inequalities but has been tailored to our proposed penalty.

Theorem 1.

(Oracle inequalities) Suppose Assumptions 1, 2, and 3 hold. Assume ρ(γ) is a norm such that ρ(γ) ≥ ‖γ2. Let dj = 1, i.e., τj=aj for j = 1, ... , p. If we choose λ4σMQ1A+2log(Mp)n for some A > 0, then the following inequalities hold with probability at least 1 − 2exp(−A∕2).

X(β^β0)24λKτ1/2κ,β^β0τ16λKτκ2,β^β0216λKτκ2.

Please see Web Appendix B for proofs. Note that the results of Theorem 1 are general and consistent with the results shown in existing literature. For example, if M = 1 and we choose ρ1(x) = x and ρ2(Γ) = ‖Γ‖F, we obtain the same results as in Yu and Liu [26]. If, in addition, there is no edge in the graph, we obtain the results similar to Bickel et al. [5].

We now present the oracle property focusing on the homogeneity model ρ(γ) = ‖γ2. The objective function can be written in terms of Γ as follows.

12yXj=1pγj22+λj=1pτjγj2,supp(γjm)=𝒜j. (15)

Let γ^1,...,γ^p be the minimizer of (15) and β^=j=1pγ^j be the solution. Let 2J1 represent the set of subsets of {𝒜j:jJ1} which covers the important variables J0. That is, R if and only if jR𝒜j=J0 Define R0=argminRjRτj2 and S0=jR0aj. This set is not empty due to Assumption 1. Note that we have 𝑆0 = s0 if the graph G has no edge. Let Q2 > 0 be the smallest eigenvalue of XJ0TXJ0 and let ξ=XJ0cTXJ0(XJ0TXJ0)1. In Theorems 2 and 3, we present low level conditions required for model selection consistency and asymptotic normality, respectively. In Corollaries 1 and 2, we list conditions for individual parameters required for the oracle property, which will depend on the adaptivity of the penalty weights dj.

Theorem 2.

(Model selection consistency) Suppose Assumptions 1 and 2 hold. Consider ρ(γ) = ‖γ2. If

log(Ms0)β*0nQ2+λS0d*Q2β*0+MQ1logM(ps0)λd*n+max(ξ,1)d*MS0d*0, (16)

then we have sign(β^)=sign(β0) with probability tending to 1.

Remark 1.

The first two terms in (16) control the deviation of the nonzero coefficients from their ground truth. The last two terms in (16) ensure the penalties are large enough to suppress the coefficients of unimportant predictors.

Our method also possesses the property of asymptotic normality. However, in order to have n-consistency, we need a stronger condition compared to the model selection consistency.

Theorem 3.

(Asymptotic normality) Assume the conditions in Theorem 2, and further assume

λd*nS0Q20. (17)

Let v=αT(XJ0TXJ0)1α for any sequence of nonzero vector 𝜶 with length MJ0. Then, we have

nαT(β^J0βJ00)/vd𝒩(0,σ2).

We now investigate conditions for individual factors which guarantee the oracle property.

Assumption 4.

𝑆0s0nα where 0 ≤ α < 1.

Assumption 5.

𝑄1𝑄2ξ ≍ 1.

Assumption 6.

β*0s01/2nα/2.

Assumption 7.

λ = (n−(1+α)∕2) and d ≍ 1.

The number of important variables must typically be less than the sample size. This is also connected in part to the condition on the smallest eigenvalue 𝑄2 of XJ0TXJ0 as in Assumption 5. The predictors can always be standardized, so we can have 𝑄1 ≍ 1 as well. The assumption ξ ≍ 1 is similar to but weaker than the irrepresentable condition [28] since the bound needs not be less than 1. We consider the signal-to-noise ratio fixed at a constant level. Therefore, βJ00m21 and Assumption 6 are plausible. Assumption 7 sets a penalty cap which limits the bias for important variables caused by the penalty. The conditions for M, p and the lower bound of λ depend on the minimum adaptive penalty weights d on the unimportant variables.

Corollary 1.

(Strongly adaptive penalty weights) Suppose Assumptions 47 hold. If dNγ∕2 with γ ≥ 1, then the conditions (16) and (17) are satisfied if

logM=o(n1α),logp=o(n1α),λ1=o(n(log(Mp))1/2).

If the adaptive penalty weights for unimportant variables are chosen at a rate of N or higher, the number M of datasets our method can accommodate for the oracle property only depends on the number of important variables, and we can have exponentially growing number of datasets with respect to n raised to a certain power. However, we find that, if the penalty weights are weakly adaptive, which means the minimum adaptive penalty weights for unimportant variables are chosen at a rate lower than N, our method may only accommodate polynomially increasing number of datasets with respect to n.

Corollary 2.

(Weakly adaptive penalty weights) Suppose Assumptions 47 hold. If dNγ∕2 with α < γ < 1, then the conditions (16) and (17) are satisfied if

M=o(nγα1γ),logp=o(M(1γ)nγα),λ1=o(M1γ2n1+γ2(log(Mp))12).

It is worth noting that while the oracle inequality (Theorem 1) holds with the convex penalty and no adaptation (dj = 1), the oracle property (Theorems 2 and 3) requires an adaptive penalty. This result is consistent with the behavior of the ordinary lasso regression. The L1 penalty can achieve the oracle inequality [4], but cannot achieve the oracle property without further assumptions [28]. The adaptive lasso [31] or non-convex penalties [9] can achieve the oracle property.

5 |. SIMULATION

We conduct a simulation study to evaluate the performance of our method compared to existing integrative learning methods that do not incorporate graph information. We compare fully heterogeneous (FHT; independent estimation and tuning) models, integrative homogeneity (IHM) models, and integrative heterogeneity (IHT) models. IHM and IHT refer to the homogeneity model and the heterogeneity model, respectively, as defined in Zhao et al. [29]. We denote our SIL methods by SIL-Lasso, SIL-MCP, and SIL-LS, which use (8), (7), and (5) for ρ1, respectively. The heterogeneity SIL-Lasso uses (8) for ρ2 while fixing α = 1 for its homogeneity version. The homogeneity versions of SIL-MCP and SIL-LS use ρ2(Γ) = ‖Γ‖F and the heterogeneity versions use ρ2(Γ) = ‖Γ𝑇2,1.

The FHT competing models include Lasso [25], Enet [32], and SRIG [26], the IHM competitors include L2 gMCP [21] and gLasso [22], and the IHT competitors include L1 gMCP [21] and sgLasso (sparse gLasso), which uses

P(B)=λαBT2,1+λ(1α)B1,1.

We describe how to generate the precision matrix of Xm. For each m = 1, ... , M, we generate a block diagonal matrix

Ωm=diag(Ω1m,...,ΩBm),

where each sub-matrix Ωbm is a pB × pB symmetric matrix. We consider three different types of graphical structure for Ωbm depending on scenarios. The detailed procedure goes as follows.

  1. Set Ωbm to a pB × pB zero matrix.

  2. Depending on scenarios, generate the nonzero lower triangular entries specified below as 𝒱(1.5,0.5).

    • Scenario 1 (ring type): [Ωbm]pB,1 and [Ωbm]k,k1 for k > 1 are nonzero.

    • Scenario 2 (hub type): [Ωbm]k1 for k > 1 are nonzero.

    • Scenario 3 (random type): Each [Ωbm]j,k(j>k) is nonzero with probability 3∕pB.

  3. Fill in the upper triangular entries;ΩbmΩbm+ΩbmT.

  4. [Ωbm]jj0.5k=1pB[Ωbm]jk.

  5. Normalize Ωbm such that the diagonal elements of its inverse matrix become 1.

The true regression coefficients βm is given by

βm=[αTΩ1mαTΩ2m00]T,

for some vector 𝜶. To create heterogeneity, the second block of features is set to have no influence on the outcome variable with probability pht. That is, we have

βm=[αTΩ1m000]Tw.p.pht.

For each scenario, each row of Xm is independently sampled from 𝒩(0,(Ωm)1). Then, the responses are generated by a linear model as follows.

ym=Xmβm+em,

where em𝒩(0,σ2I). We generate a total of N = nM observations with each dataset assigned n samples.

We consider M = 5 datasets with p = 100 features (B = 10, pB = 10). The error variance is σ2 = 1 and we use α = [ 1 1∕3 ⋯ 1∕3]T for scenarios 1 and 3 and α = [ 1 1∕4 ⋯ 1∕4]T for scenario 2. This yields roughly 2.5 signal-to-noise ratio for all scenarios. We use the validation method for tuning our methods. The tuning parameters are selected simultaneously via a grid point search over the multi-dimensional tuning parameter space. For example, IHM-SIL-LS for Scenario 1 in Table 1 searches over the 25 × 10 × 6 points of (λ, η, λR). The tuple that minimizes the validated prediction error was selected and used for predicting the testing data. The training sample size is n = 200, the validation sample size is nυ = 200, and the testing sample size is nt = 1000. Every method is fitted for a total of 100 replicates and tuned by the validation method. In Tables, we report the simulation results evaluated by the mean squared prediction error (MSE), the average L2 distance between the estimated coefficients and the true coefficients, the false positive rates (FPR), and the false negative rates (FNR).

TABLE 1.

Simulation results for homogeneity data. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, 𝒢; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L2; average L2 distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.

Type Method 𝒢 Scenario 1 Scenario 2 Scenario 3
MSE L 2 FPR FNR MSE L 2 FPR FNR MSE L 2 FPR FNR

FHT Lasso 1.274 (.004) 1.509 (.010) 0.330 (.005) 0.238 (.005) 1.348 (.004) 1.902 (.014) 0.402 (.005) 0.234 (.006) 1.274 (.004) 1.539 (.014) 0.313 (.005) 0.297 (.008)
Enet 1.274 (.004) 1.509 (.010) 0.330 (.005) 0.238 (.005) 1.348 (.004) 1.902 (.014) 0.402 (.005) 0.234 (.006) 1.274 (.004) 1.539 (.014) 0.313 (.005) 0.297 (.008)
SRIG Y 1.156 (.004) 1.052 (.008) 0.125 (.003) 0.112 (.003) 1.157 (.003) 1.269 (.014) 0.100 (.005) 0.000 (.000) 1.194 (.005) 1.223 (.015) 0.208 (.006) 0.089 (.004)

IHM gLasso 1.185 (.004) 1.262 (.010) 0.573 (.008) 0.026 (.003) 1.255 (.004) 1.670 (.013) 0.718 (.008) 0.010 (.002) 1.186 (.004) 1.285 (.014) 0.576 (.009) 0.096 (.007)
L2 gMCP 1.157 (.005) 1.116 (.019) 0.156 (.015) 0.231 (.015) 1.164 (.004) 1.097 (.017) 0.152 (.014) 0.056 (.007) 1.144 (.004) 1.070 (.016) 0.161 (.016) 0.254 (.012)
SIL-Lasso Y 1.099 (.004) 0.838 (.018) 0.187 (.019) 0.006 (.002) 1.130 (.003) 1.011 (.013) 0.169 (.016) 0.000 (.000) 1.109 (.004) 0.916 (.014) 0.229 (.017) 0.009 (.002)
SIL-MCP Y 1.109 (.004) 0.911 (.020) 0.092 (.015) 0.037 (.005) 1.120 (.003) 0.923 (.014) 0.045 (.009) 0.000 (.000) 1.111 (.003) 0.935 (.014) 0.122 (.014) 0.030 (.004)
SIL-LS Y 1.102 (.004) 0.864 (.018) 0.119 (.017) 0.028 (.004) 1.120 (.003) 0.921 (.014) 0.066 (.013) 0.000 (.000) 1.107 (.004) 0.912 (.013) 0.115 (.015) 0.033 (.005)

IHT sgLasso 1.194 (.004) 1.308 (.014) 0.536 (.016) 0.034 (.004) 1.266 (.004) 1.706 (.017) 0.689 (.013) 0.021 (.004) 1.194 (.004) 1.318 (.016) 0.552 (.015) 0.105 (.008)
L1 gMCP 1.190 (.006) 1.260 (.017) 0.052 (.005) 0.348 (.009) 1.190 (.006) 1.140 (.026) 0.065 (.005) 0.136 (.012) 1.170 (.006) 1.164 (.020) 0.068 (.006) 0.345 (.011)
SIL-Lasso Y 1.105 (.004) 0.852 (.016) 0.204 (.019) 0.011 (.002) 1.135 (.003) 1.044 (.014) 0.159 (.014) 0.000 (.000) 1.115 (.004) 0.936 (.014) 0.242 (.018) 0.015 (.003)
SIL-MCP Y 1.127 (.004) 0.991 (.019) 0.045 (.007) 0.072 (.005) 1.128 (.005) 0.957 (.028) 0.031 (.006) 0.000 (.000) 1.130 (.004) 1.022 (.015) 0.073 (.009) 0.068 (.006)
SIL-LS Y 1.116 (.004) 0.923 (.018) 0.061 (.009) 0.060 (.004) 1.125 (.004) 0.944 (.015) 0.046 (.010) 0.000 (.000) 1.123 (.004) 0.974 (.014) 0.084 (.010) 0.064 (.006)

In Table 1, we consider the case where all datasets have homogeneous sparsity structure, i.e., pht = 0. For all scenarios, the integrative approaches (IHM and IHT) tend to yield better performance than the fully heterogeneous methods (FHT), as the integrative approaches are able to take advantage of the common sparsity structure of coefficients, except when the graph information is incorporated (SRIG). Since our methods also uses the graphical knowledge, they have clearly improved performance than other existing integrative learning methods. Particularly, our three IHM methods show the best performance among all. Although our IHT versions seem to lose a little more weak signals than our IHM versions do, it is still much less serious than other existing IHT methods. This demonstrates the advantages of incorporating network information into integrative learning.

In Table 2, some datasets can have different sparsity with pht = 0.3. We observe similar performance patterns as in Table 1. Although all methods pose slightly worse FPR and FNR compared to Table 1 due to the heterogeneity in sparsity structure of coefficients, our methods still have substantially improved variable selection performance over the ones with no graph incorporation or the non-integrative learning methods. It is particularly worth noting that existing IHT methods show more worse results compared to the homogeneity data (Table 1) than our IHT methods do. This again confirms the advantages of incorporating network information into integrative learning.

TABLE 2.

Simulation results for heterogeneity data. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, 𝒢; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L2; average L2 distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.

Type Method 𝒢 Scenario 1 Scenario 2 Scenario 3
MSE L 2 FPR FNR MSE L 2 FPR FNR MSE L 2 FPR FNR

FHT Lasso 1.245 (.005) 1.427 (.011) 0.295 (.005) 0.247 (.005) 1.310 (.005) 1.821 (.015) 0.357 (.006) 0.257 (.007) 1.244 (.005) 1.458 (.015) 0.278 (.005) 0.313 (.008)
Enet 1.245 (.005) 1.427 (.011) 0.295 (.005) 0.247 (.005) 1.310 (.005) 1.821 (.015) 0.357 (.006) 0.257 (.007) 1.244 (.005) 1.458 (.015) 0.278 (.005) 0.313 (.008)
SRIG Y 1.135 (.004) 0.967 (.011) 0.112 (.003) 0.115 (.004) 1.132 (.004) 1.149 (.016) 0.094 (.006) 0.000 (.000) 1.164 (.005) 1.120 (.015) 0.193 (.006) 0.088 (.005)

IHM gLasso 1.178 (.004) 1.238 (.010) 0.576 (.008) 0.030 (.004) 1.243 (.004) 1.641 (.013) 0.706 (.008) 0.016 (.003) 1.179 (.004) 1.262 (.013) 0.570 (.009) 0.099 (.007)
L2 gMCP 1.159 (.005) 1.123 (.021) 0.157 (.015) 0.265 (.016) 1.169 (.006) 1.152 (.024) 0.209 (.016) 0.080 (.011) 1.147 (.004) 1.090 (.018) 0.180 (.018) 0.281 (.015)
SIL-Lasso Y 1.105 (.005) 0.860 (.020) 0.217 (.019) 0.013 (.003) 1.137 (.004) 1.064 (.022) 0.215 (.020) 0.003 (.002) 1.113 (.004) 0.938 (.017) 0.266 (.018) 0.014 (.003)
SIL-MCP Y 1.113 (.004) 0.913 (.021) 0.123 (.015) 0.049 (.006) 1.127 (.005) 0.980 (.025) 0.073 (.010) 0.003 (.002) 1.113 (.004) 0.945 (.018) 0.140 (.014) 0.034 (.004)
SIL-LS Y 1.107 (.004) 0.884 (.020) 0.137 (.016) 0.039 (.005) 1.126 (.005) 0.970 (.025) 0.085 (.012) 0.003 (.002) 1.110 (.004) 0.926 (.017) 0.156 (.016) 0.033 (.005)

IHT sgLasso 1.195 (.004) 1.311 (.016) 0.499 (.019) 0.061 (.007) 1.260 (.005) 1.706 (.021) 0.629 (.016) 0.051 (.008) 1.196 (.005) 1.335 (.019) 0.500 (.019) 0.145 (.012)
L2 gMCP 1.191 (.006) 1.246 (.020) 0.073 (.006) 0.369 (.012) 1.192 (.006) 1.197 (.029) 0.079 (.005) 0.192 (.013) 1.166 (.005) 1.164 (.023) 0.071 (.005) 0.381 (.012)
SIL-Lasso Y 1.107 (.005) 0.862 (.022) 0.214 (.018) 0.016 (.003) 1.128 (.004) 1.033 (.018) 0.179 (.017) 0.001 (.001) 1.119 (.006) 0.952 (.021) 0.249 (.018) 0.025 (.004)
SIL-MCP Y 1.129 (.004) 0.987 (.019) 0.070 (.007) 0.091 (.008) 1.126 (.005) 0.957 (.026) 0.067 (.006) 0.000 (.000) 1.133 (.005) 1.031 (.020) 0.085 (.008) 0.087 (.009)
SIL-LS Y 1.117 (.004) 0.913 (.020) 0.082 (.010) 0.079 (.006) 1.124 (.004) 0.970 (.020) 0.073 (.009) 0.001 (.001) 1.122 (.005) 0.976 (.020) 0.096 (.010) 0.080 (.008)

As the proposed methods rely on the graph information, we conduct the sensitivity analysis taking into account uncertainty of the graphical knowledge and inconsistency with the regression coefficients. In the sensitivity analysis, we randomly remove about 20% of edges from the true graph and use the reduced graph as a working graph. This mimics the intermediate situation where only strong interactions are known or the case where there are potentially missing edges (partial correlations) due to a screening of predictors. In Table 3, we can see the performance of the methods that use the graph information deteriorates, while the methods that do not use the graph information remain similar compared to Table 1. However, the difference for our methods is very small compared to that of SRIG, which seems attributed to the effect of integrative learning. This lends support to robustness of our methods to misspecified graphical information with missing edges.

TABLE 3.

Sensitivity analysis results. FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, 𝒢; Y indicates the method incorporates graph information, MSE; mean squared prediction error, L2; average L2 distance between estimated coefficients and true coefficients, FPR; false positive rates, FNR; false negative rates.

Type Method 𝒢 Scenario 1 Scenario 2 Scenario 3
MSE L 2 FPR FNR MSE L 2 FPR FNR MSE L 2 FPR FNR

FHT Lasso 1.267 (.004) 1.508 (.009) 0.322 (.005) 0.250 (.005) 1.349 (.004) 1.925 (.014) 0.400 (.006) 0.226 (.006) 1.279 (.005) 1.565 (.014) 0.329 (.005) 0.285 (.007)
Enet 1.267 (.004) 1.508 (.009) 0.322 (.005) 0.250 (.005) 1.349 (.004) 1.925 (.014) 0.400 (.006) 0.226 (.006) 1.279 (.005) 1.565 (.014) 0.329 (.005) 0.285 (.007)
SRIG Y 1.184 (.005) 1.220 (.018) 0.150 (.006) 0.166 (.006) 1.245 (.006) 1.495 (.024) 0.275 (.009) 0.158 (.007) 1.239 (.006) 1.469 (.018) 0.151 (.005) 0.199 (.007)

IHM gLasso 1.178 (.004) 1.256 (.011) 0.583 (.008) 0.024 (.004) 1.254 (.004) 1.685 (.012) 0.717 (.007) 0.005 (.002) 1.191 (.004) 1.310 (.013) 0.607 (.009) 0.084 (.006)
L2 gMCP 1.148 (.004) 1.080 (.019) 0.251 (.020) 0.166 (.016) 1.161 (.004) 1.108 (.017) 0.168 (.015) 0.042 (.006) 1.146 (.005) 1.080 (.019) 0.182 (.014) 0.214 (.010)
SIL-Lasso Y 1.108 (.004) 0.917 (.015) 0.221 (.020) 0.006 (.002) 1.141 (.003) 1.102 (.012) 0.257 (.016) 0.002 (.001) 1.114 (.004) 0.969 (.015) 0.231 (.017) 0.017 (.004)
SIL-MCP Y 1.110 (.004) 0.931 (.016) 0.111 (.017) 0.032 (.005) 1.124 (.004) 0.942 (.015) 0.073 (.010) 0.006 (.002) 1.112 (.004) 0.951 (.014) 0.099 (.015) 0.060 (.007)
SIL-LS Y 1.106 (.004) 0.906 (.015) 0.124 (.018) 0.030 (.005) 1.122 (.003) 0.949 (.011) 0.082 (.012) 0.008 (.002) 1.107 (.004) 0.927 (.013) 0.081 (.014) 0.054 (.006)

IHT sgLasso 1.189 (.004) 1.290 (.014) 0.576 (.016) 0.035 (.006) 1.263 (.004) 1.719 (.017) 0.688 (.012) 0.017 (.003) 1.204 (.005) 1.363 (.015) 0.556 (.016) 0.104 (.008)
L1 gMCP 1.187 (.007) 1.254 (.020) 0.070 (.008) 0.350 (.011) 1.179 (.006) 1.124 (.023) 0.056 (.004) 0.132 (.012) 1.164 (.005) 1.141 (.020) 0.074 (.005) 0.321 (.012)
SIL-Lasso Y 1.113 (.004) 0.934 (.016) 0.217 (.019) 0.015 (.003) 1.149 (.004) 1.131 (.014) 0.247 (.015) 0.014 (.002) 1.115 (.004) 0.968 (.012) 0.235 (.017) 0.020 (.004)
SIL-MCP Y 1.130 (.004) 1.026 (.016) 0.042 (.006) 0.105 (.007) 1.135 (.004) 0.981 (.021) 0.029 (.004) 0.047 (.005) 1.127 (.004) 1.015 (.015) 0.035 (.005) 0.121 (.009)
SIL-LS Y 1.120 (.004) 0.971 (.015) 0.058 (.009) 0.090 (.006) 1.137 (.003) 0.997 (.013) 0.049 (.007) 0.060 (.004) 1.122 (.004) 0.989 (.016) 0.044 (.007) 0.110 (.008)

6 |. APPLICATION

Alzheimer’s disease (AD) is a major cause of dementia. The Alzheimer’s disease neuroimaging initiative (ADNI) is a large scale multisite longitudinal study where researchers at 63 sites track the progression of AD in the human brain through the process of normal aging, early mild cognitive impairment (EMCI), and late mild cognitive impairment (LMCI) to dementia or AD. Its goal is to validate diagnostic and prognostic biomarkers that can predict the progress of AD.

In our data analysis, we investigate the association of patients’ gene expression levels with an imaging marker that captures AD progression. Specifically, we treat the fluorodeoxyglucose positron emission tomography (FDG-PET) averaged over the regions of interest (ROI) as the response variable, which measures cell metabolism. Cells affected by AD tend to show reduced metabolism. Since the association of FDG with gene expression levels may change at different stages of AD, we divide the total of 675 subjects into three groups depending on their baseline disease status, namely, CN (cognitively normal, n = 229), MCI (EMCI+LMCI, n = 402), and AD (n = 44).

The samples in each group are randomly split into a training set (50%), a validation set (25%), and a testing set (25%). For each split, we fit with our method and the existing methods considered in Section 5 plus some fully homogeneous models to check the heterogeneity of the datasets, and report the prediction errors for the testing samples. The regularization parameters of all methods are tuned by validation method and the graph information is obtained from KEGG. This procedure is repeated for 200 random splits of the data and the average squared prediction errors are reported in Table 4.

TABLE 4.

Average prediction errors for ADNI dataset. FHM; fully homogeneous models, FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, 𝒢; Y indicates the method incorporates graph information.

Type Method 𝒢 MCI AD CN

FHM Lasso 0.926 1.094 1.020
Enet 0.901 1.075 0.987
SRIG Y 0.955 1.063 1.035

FHT Lasso 0.916 1.028 0.996
Enet 0.881 0.983 0.961
SRIG Y 0.933 1.035 1.008

IHM gLasso 0.934 1.017 0.991
L2 gMCP 0.898 1.027 1.005
SIL-Lasso Y 0.873 0.946 0.946
SIL-MCP Y 0.876 0.947 0.948
SIL-LS Y 0.879 0.948 0.945

IHT sgLasso 0.939 1.022 0.996
L1 gMCP 0.914 1.045 1.001
SIL-Lasso Y 0.862 0.950 0.940
SIL-MCP Y 0.878 0.934 0.955
SIL-LS Y 0.881 0.941 0.949

As shown in Table 4, all of the FHM methods tend to underperform the corresponding FHT methods, suggesting that the model of interest likely has different parameters for different groups. Despite such heterogeneity, our methods show best prediction performance for all groups. The existing integrative learning approaches, which do not incorporate network information, seem to have difficulty integrating information from different datasets.

Another benefit of incorporating graphical pathway information is enhanced interpretability of the selected genes. To confirm, we conduct the pathway enrichment analysis based on the 30 most frequently selected genes of each method during the 200 repeats. Table 5 include 10 enriched pathways that are related to Alzheimer disease and the associated p-values. Any method that does not incorporate graph information, including the existing integrative learning approaches, has no enriched pathway. Except SILs, only the fully heterogeneous SRIG yields some enriched pathways. However, the p-values of SRIG tend to be larger than those of our methods.

TABLE 5.

Ten enriched pathways and p-values for each method. ‘-’ indicates not enriched in the genes selected by the method. FHM; fully homogeneous models, FHT; fully heterogeneous models, IHM; integrative homogeneity models, IHT; integrative heterogeneity models, 𝒢; Y indicates the method incorporates graph information, P1; AGE-RAGE signaling pathway, P2; Angiopoietin receptor Tie2-mediated signaling, P3; Chemokine signaling pathway, P4; CXCR4-mediated signaling events, P5; Glucocorticoid receptor regulatory network, P6; IL2-mediated signaling events, P7; MAPKinase Signaling Pathway, P8; Prolactin signaling pathway, P9; Signaling by PDGF, P10; Tuberculosis.

Type Method 𝒢 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10

FHM Lasso - - - - - - - - - -
Enet - - - - - - - - - -
SRIG Y - - - - - - - - - -

FHT Lasso - - - - - - - - - -
Enet - - - - - - - -
SRIG Y 1.6e-5 5.8e-5 - - 2.9e-4 7.3e-5 2.8e-4 1.8e-4 - -

IHM gLasso - - - - - - - - - -
L2 gMCP - - - - - - - -
SIL-Lasso Y 2.1e-9 4.2e-6 - 9.7e-8 1.1e-6 5.8e-6 1.1e-6 - 1.6e-6 2.9e-6
SIL-MCP Y 1.1e-6 1.9e-6 2.1e-5 1.2e-6 1.6e-5 4.1e-8 1.6e-5 1.9e-7 4.9e-6 1.9e-5
SIL-LS Y 1.3e-6 1.1e-4 - - 2.0e-5 1.5e-4 1.1e-4 1.0e-5 8.1e-5 -

IHT sgLasso - - - - - - - - - -
L1 gMCP - - - - - - - -
SIL-Lasso Y 5.8e-11 1.3e-9 7.7e-9 1.3e-12 1.3e-6 1.3e-7 1.3e-6 - - 1.7e-7
SIL-MCP Y - 2.1e-4 2.6e-4 - 1.5e-6 - 5.0e-6 2.7e-5 2.0e-4
SIL-LS Y 1.3e-6 1.1e-4 - 4.4e-5 - 1.5e-4 1.1e-4 1.0e-5 - -

7 |. DISCUSSION

We have proposed a novel integrative learning method, called SIL, which can incorporate the graphical structure of features. SIL possesses appealing theoretical properties, is scalable to high-dimensional data, and has been shown to outperform existing integrative learning methods through a simulation study and a real data analysis.

In practice, the ground truth sparsity structure of β0 may not be consistent with the graphical structure. However, when the discrepancy is moderate, our proposed method will still show reasonably good performance by detecting the subset of groups that cover all or most of the nonzero coefficients. Note that the sensitivity analysis (Table 3), which is conducted partly in consideration of such inconsistency, suggests the proposed method is quite robust. Even when the graphical information is completely irrelevant to the sparsity structure, our method will not fail. The tuning procedure will discourage the group-wise selection and we can expect the performance to be comparable to that of the plain ridge regression.

On the other hand, it is widely acknowledged that the graph information obtained from existing databases could be inaccurate or incomplete. It is potentially of future interest to investigate approaches that are more robust to incomplete graph information. One potential approach is to combine the graph information from existing databases and the estimated graph information using the data being analyzed. Another direction for future research is to incorporate graph information that may vary between datasets.

Supplementary Material

SUPINFO

ACKNOWLEDGMENTS

This work is partly supported by NIH grant RF1AG063481. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The complete ADNI Acknowledgement is available here (click).

Footnotes

Conflict of interest

The authors declare no potential conflict of interests.

Financial disclosure

None reported.

SUPPORTING INFORMATION

The supplementary material available online includes additional algorithms and the proof of theorems.

References

  • [1].Armagan A, Dunson DB, and Lee J, 2013: Generalized double pareto shrinkage. Statistica Sinica, 23, no. 1, 119–143. [PMC free article] [PubMed] [Google Scholar]
  • [2].Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, and Sherlock G, 2000: Gene ontology: tool for the unification of biology. Nature Genetics, 25, no. 1, 25–29, doi: 10.1038/75556. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Beck A and Teboulle M, 2009: A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2, no. 1, 183–202. [Google Scholar]
  • [4].Bühlmann P and van de Geer S, 2011: Statistics for high-dimensional data: Methods, theory and applications. Springer series in statistics Berlin: Springer. [Google Scholar]
  • [5].Bickel PJ, Ritov Y, and Tsybakov AB, 2009: Simultaneous analysis of lasso and dantzig selector. Ann. Statist, 37, no. 4, 1705–1732, doi: 10.1214/08-AOS620. URL 10.1214/08-AOS620 [DOI] [Google Scholar]
  • [6].Candès EJ, Wakin MB, and Boyd SP, 2008: Enhancing sparsity by reweighted l1 minimization. Journal of Fourier Analysis and Applications, 14, no. 5, 877–905, doi: 10.1007/s00041-008-9045-x. [DOI] [Google Scholar]
  • [7].Chang C, Kundu S, and Long Q, 2018: Scalable bayesian variable selection for structured high-dimensional data. Biometrics, 74, no. 4, 1372–1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Chang C, Oh J, and Long Q, 2020: Gria: Graphical regularization for integrative analysis. Proceedings of the 2020 SIAM International Conference on Data Mining, 604–612. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Fan J and Li R, 2001: Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, no. 456, 1348–1360, doi: 10.1198/016214501753382273. [DOI] [Google Scholar]
  • [10].Gong P, Zhang C, Lu Z, Huang JZ, and Ye J, 2013: A general iterative shrinkage and thresholding algorithm for non-convex regularized optimization problems. Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, JMLR.org, ICML’13, II–37–II–45. URL http://dl.acm.org/citation.cfm?id=3042817.3042898 [PMC free article] [PubMed] [Google Scholar]
  • [11].Huang Y, Zhang Q, Zhang S, Huang J, and Ma S, 2017: Promoting similarity of sparsity structures in integrative analysis with penalization. Journal of the American Statistical Association, 112, no. 517, 342–350. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Jacob L, Obozinski G, and Vert J-P, 2009: Group lasso with overlap and graph lasso. Proceedings of the 26th Annual International Conference on Machine Learning, ACM, New York, NY, USA, ICML ’09, 433–440. [Google Scholar]
  • [13].Kanehisa M, Furumichi M, Tanabe M, Sato Y, and Morishima K, 2017: Kegg: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research, 45, no. D1, D353–D361. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Li C and Li H, 2008: Network-constrained regularization and variable selection for analysis of genomic data. Bioinformatics, 24, no. 9, 1175–1182. [DOI] [PubMed] [Google Scholar]
  • [15].Li F and Zhang NR, 2010: Bayesian Variable Selection in Structured High-Dimensional Covariate Spaces with Applications in Genomics. Journal of the American Statistical Association, 105, no. 491, 1202–1214. [Google Scholar]
  • [16].Li Q, Wang S, Huang C-C, Yu M, and Shao J, 2014: Meta-analysis based variable selection for gene expression data. Biometrics, 70, no. 4, 872–880. [DOI] [PubMed] [Google Scholar]
  • [17].Li Z, Chang C, Kundu S, and Long Q, 2020: Bayesian generalized biclustering analysis via adaptive structured shrinkage. Biostatistics, 21, no. 3, 610–624. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Liu B, Wu C, Shen X, and Pan W, 2017: A novel and efficient algorithm for de novo discovery of mutated driver pathways in cancer. The annals of applied statistics, 11, no. 3, 1481. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [19].Liu J, Huang J, and Ma S, 2013: Incorporating network structure in integrative analysis of cancer prognosis data. Genetic Epidemiology, 37, no. 2, 173–183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Liu J, Huang J, Zhang Y, Lan Q, Rothman N, Zheng T, and Ma S, 2014: Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics, 70, no. 3, 480–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Liu J, Ma S, and Huang J, 2014: Integrative analysis of cancer diagnosis studies with composite penalization. Scandinavian Journal of Statistics, 41, no. 1, 87–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [22].Ma S, Huang J, and Song X, 2011: Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics, 12, no. 4, 763–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Pan W, Xie B, and Shen X, 2010: Incorporating predictor network in penalized regression with application to microarray data. Biometrics, 66, no. 2, 474–484. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [24].Stingo FC, Chen YA, Tadesse MG, and Vannucci M, 2011: Incorporating Biological Information into Linear Models: A Bayesian Approach to the Selection of Pathways and Genes. Annals of Applied Statistics, 5, no. 3, 1978–2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Tibshirani R, 1996: Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58, no. 1, 267–288. [Google Scholar]
  • [26].Yu G and Liu Y, 2016: Sparse regression incorporating graphical structure among predictors. Journal of the American Statistical Association, 111, no. 514, 707–720. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Zhang C-H, 2010: Nearly unbiased variable selection under minimax concave penalty. Ann. Statist, 38, no. 2, 894–942, doi: 10.1214/09-AOS729. URL 10.1214/09-AOS729 [DOI] [Google Scholar]
  • [28].Zhao P and Yu B, 2006: On model selection consistency of lasso. J. Mach. Learn. Res, 7, 2541–2563. [Google Scholar]
  • [29].Zhao Q, Shi X, Huang J, Liu J, Li Y, and Ma S, 2015: Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdisciplinary Reviews: Computational Statistics, 7, no. 1, 99–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [30].Zhao Y, Chang C, and Long Q, 2019: Knowledge-guided statistical learning methods for analysis of high-dimensional -omics data in precision oncology. JCO Precision Oncology, no. 3, 1–9, doi: 10.1200/PO.19.00018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [31].Zou H, 2006: The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, no. 476, 1418–1429. [Google Scholar]
  • [32].Zou H and Hastie T, 2005: Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, no. 2, 301–320. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

SUPINFO

RESOURCES