Abstract
Partial least squares, as a dimension reduction technique, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken estimation performance, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. Integrative analysis has essential importance in multidatasets analysis. The main idea is to improve performance by assembling raw data from multiple independent datasets and analyzing them jointly. In this article, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis. The second penalty, a contrasted penalty, is imposed to encourage the similarity of estimates across datasets and generate more sensible and accurate results. Computational algorithms are developed. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.
Keywords: contrasted penalization, integrative analysis, partial least squares
1 ∣. INTRODUCTION
Data with high-dimensional variables are becoming routine. With such data, partial least squares, initially developed by Wold et al,1 has been successfully used as a dimension reduction method in many areas such as chemometrics2 and genetics.3 PLS reduces variable dimension by constructing new components, which are linear combinations of the original variables. It possesses much-desired properties such as stability under collinearity and high-dimensionality, rendering it clear superiority over many other methods. In high-dimensional analysis, noise accumulation from irrelevant variables has long been recognized.4 For example, in omics studies, it is wildly accepted that only a small fraction of genes are associated with outcomes. To yield more accurate estimation and facilitate interpretation, variable selection needs to be considered. Chun and Keleş5 propose a sparse PLS (SPLS) technique to conduct variable selection and dimension reduction simultaneously by imposing the elastic net penalization in the PLS optimization.
In general, for data analysis with a large number of variables but a limited sample size, performance is often unsatisfactory.6 With fast data accumulation, one possible solution is to pool information from multiple datasets generated under similar protocols to increase power. Common multidatasets approaches include meta-analysis, integrative analysis, and others. In a series of studies, integrative analysis has been shown, both theoretically and numerically, as highly effective,7,8 especially when comparing to “classic” meta-analysis.9 More discussions on integrative analysis are available in the literature10 and provided in Section 2.2.
Considering the extensive applications of PLS/SPLS in high-dimensional data analysis, and motivated by the success of integrative analysis built on other techniques, in this article, our goal is to develop an integrative SPLS (iSPLS) approach, which can improve the performance of PLS/SPLS by effectively borrowing information from multiple comparable studies. The proposed approach applies penalization to the PLS objective function. In particular, it consists of two penalty terms. The first penalty conducts regularized estimation and variable selection, as in a “standard” integrative analysis study.10 The second penalty—a key innovation—is further imposed to encourage certain similarity across datasets, which may further improve performance and also facilitate interpretation. Overall, this study advances the PLS/SPLS technique into the integrative analysis paradigm. It also advances integrative analysis by introducing SPLS, a highly effective sparse dimension reduction technique. Our numerical analysis demonstrates that it provides a practically useful tool for high-dimensional data analysis. With significant methodological and numerical developments, this study is warranted beyond the existing literature.
The rest of the article is organized as follows. In Section 2, we first briefly review the general principles of PLS and SPLS for the completeness of this article, and then formulate the iSPLS method and develop effective computational algorithms. Simulation studies and applications to TCGA data are conducted in Sections 3 and 4, respectively. The article concludes with discussions in Section 5. Additional technical details and numerical results are provided in the Appendix.
2 ∣. METHODS
2.1 ∣. Sparse partial least squares
Let Y ∈ Rn × q and X ∈ Rn × p denote the response matrix and predictor matrix, respectively. PLS assumes that there exists latent components tk, 1 ≤ k ≤ K, which are linear combinations of the predictors, such that Y = TQ⊤ + F and X = TP⊤ + E, where T = (t1, … , tK)n×K, P∈Rp×K and Q∈Rq×K are matrices of coefficients (loadings), and E∈Rn×p and F∈Rn×q are matrices of random errors.
PLS solves the optimization problem for direction vectors wk’s successively. Specifically, wk is the solution to:
| (1) |
under certain constraints, where Z=X⊤Y. This optimization problem can be solved using the NIPALS,1 SIMPLS,11 and other algorithms. After estimating the number of direction vectors K and direction matrix , the latent components can be calculated by . And the final estimator is , where is the solution of . Details are available in the literature.12
With high-dimensional variables, regularized estimation is needed to distinguish signals from noises. Note that noisy variables enter the PLS regression through the direction vectors. As such, one possibility is to apply penalization in the optimization procedure, for example, imposing an constraint on the direction vector in (1). Consider the first SPLS direction vector, which can be obtained by solving:
| (2) |
where the tuning parameter λ controls the degree of sparsity.
However, Jolliffe et al13 point out the concavity issue of problem (2) as well as the lack of sparsity. Chun and Keleş5 then develop a generalized form of the SPLS problem (2), which has the following formulation and produces a sparse solution:
| (3) |
Here c is a surrogate of the direction vector w and very close to w. Penalties are imposed on c rather than the original direction w. The additional penalty deals with the singularity of ZZ⊤ when solving for c, and a small κ reduces the effect of the concave part. The solution to problem (3) is obtained by optimizing w and c iteratively. More details are available in the literature.5 Once the first direction vector is obtained, the response matrix can be updated, and the second and other consecutive directions can be generated using the same technique. With the methodological similarity, published studies have usually focused on estimating the first direction, which is also attributable to its higher importance.
2.2 ∣. Integrative sparse partial least squares
2.2.1 ∣. Integrative analysis
Integrative analysis is a broadly applicable strategy for pooling information from multiple independent datasets with comparable designs to improve performance under high-dimensional settings.7,14,15 The key is to note that one variable is measured in multiple datasets, and these measures can be viewed as forming a group. In regularized estimation (and variable selection), group-based approaches can then be applied to such groups. When variable selection is of major interest, two scenarios have been generally considered. Under the homogeneity model, multiple datasets share the same set of important variables. Then “group in or group out” approaches can be applied. Under the more flexible heterogeneity model, multiple datasets can have different sets of important variables, making it necessary to further identify important members within important groups (ie, a two-level selection). In integrative analysis, penalization has been a favorable tool. Integrative analysis has been conducted on gene expression, SNP, and other molecular data, under regression, clustering, and other contexts, and with applications to cancer and other biomedical studies. Comprehensive reviews have been conducted on integrative analysis.10
We note that in the literature, “integrative analysis” has also been used for the collective analysis of multiple types of molecular measurements on the same subjects.16 It is especially noted that SPLS based techniques have been developed for such analysis. Integrating multiple types of measurements on the same subjects has been referred to as “vertical data integration” in the literature. In comparison, the proposed integrating multiple independent datasets has been referred to as “horizontal data integration.” As SPLS under vertical data integration deals with significantly different settings and demands different developments, we refer to the literature16 and will not discuss further.
2.2.2 ∣. Data and model settings
Consider the scenario with L datasets from independent studies with comparable designs. As well-demonstrated in the literature, this is a sensible scenario for many “common” problems.6,9 Below, we develop an integrative sparse partial least squares (iSPLS) method to conduct the integrative analysis of such data based on the SPLS technique. To simplify notation, we assume that these datasets have matched predictors, and the proposed approach can be applied to mismatching predictors with minor modifications.17 Prior to analysis, data preprocessing, including imputation, centralization, and normalization, needs to be conducted for each dataset separately.
We use superscript (l) to denote the lth dataset with nl i.i.d. observations, l∈{1, … , L}. As under typical PLS analysis, both the responses and covariates are multidimensional. Below we focus on estimating the first direction vector of each dataset, with which the response matrix can be updated separately, and the consecutive direction vectors can be estimated iteratively in the same manner. Denote , j∈{1, … , p}, as the weight of the jth variable in the first direction vector of the lth dataset and as the “group” of weights of the jth variable.
2.2.3 ∣. iSPLS with contrasted penalization
For the integrative SPLS analysis of L datasets, we propose the penalized objective function:
| (4) |
where f(w(l), c(l)) = −κw(l)⊤Z(l)Z(l)⊤w(l)+(l – κ)(c(l) – w(l))⊤Z(l)Z(l)⊤(c(l) – w(l)), , and Z(l)=X(l)⊤Y(l).
Similar to many of the existing integrative analyses,10 here the objective function has the familiar “lack-of-fit + penalty” form. Specifically, f(w(l), c(l)) is the lack-of-fit of the lth dataset. Under the independence assumption, the L lack-of-fit functions are directly added. As in single-dataset SPLS, accommodates potential singularity when solving for c(l). To avoid analysis being dominated by large datasets, the normalization constants are added. Note that here interest lies in the joint estimation of multiple directions, as opposed to estimating those with larger sample sizes. The first penalty pen1(·) conducts variable selection in the context of integrative analysis. The second penalty pen2(·), which is a key innovation of the proposed approach, accommodates similarity among datasets. Below we discuss the two penalties in more details.
2.2.4 ∣. Penalization for variable selection
First consider pen1(·). With L datasets, L sparsity structures of the direction vectors need to be estimated. As described in the literature,10 the homogeneity and heterogeneity models have different properties and demand different estimation strategies. Denote I(·) as the indicator function. Under the homogeneity structure, , for any j∈{1, … , p}, which means that the L datasets share the same set of important variables. Under the heterogeneity structure, it is possible that for some j∈{1, … , p} and l,l′∈{1, ⋯, L}. That is, one variable can be important in some datasets but irrelevant in others.
The proposed penalties are built on the MCP (minimax concave penalty),18 which has been shown as having favorable or similar performance when compared to other penalties. MCP is defined by and its derivative , where λ is a penalty parameter, γ is a regularization parameter that controls the concavity of ρ, x+= xI(x > 0), and sgn(t)=−1, 0, or 1 for t < 0, t = 0, or t > 0, respectively. Based on the MCP, we consider the following penalties tailored to the two sparsity structures.
iSPLS under the homogeneity model
Consider the penalty:
with regularization parameter a and tuning parameter μ1. Here is the norm of .
This is a 2-norm group MCP,8,19 under which the L datasets select the same set of variables (ie, group in or group out).
iSPLS under the heterogeneity model
Consider the penalty:
with regularization parameters a and b, and tuning parameter μ1. Here the inner penalty is also MCP and determines the individual importance for the components of cj. Overall, this is a composite MCP. It is noted that this penalty allows but does not reinforces different sparsity structures. In particular, it is possible that all parameters within one group are estimated as nonzero, in a manner similar to the heterogeneity model including the homogeneity model as a special case. Under the penalization framework, an alternative to composite penalization is sparse group penalization.19 Both techniques have been extensively adopted in the literature, and there is still a lack of direct and definitive comparison. We postpone conducting similar analysis but using sparse group penalization to future research.
2.2.5 ∣. Contrasted penalization
The 2-norm MCP and composite MCP can conduct regularized estimation and variable selection. However, simply by looking at their forms, it is easy to see that they do not have enough “attention” to the relationship among datasets. In particular, the ground of multidatasets analysis is the similarity across datasets.6,9 When the datasets are reasonably similar, it can be sensible to expect certain similarity in the estimates. However, the penalties described above do not have a mechanism to encourage such similarity. As in some published literature,17 we further advance the proposed approach by introducing pen2(·), whose goal is to promote similarity in the estimates. In particular, we adopt different penalties to encourage different types of similarity. Here we note that determining whether multiple datasets are comparable or similar is a challenging task. Luckily, this has been examined in the literature6,17 and will not be reiterated here. It is also noted that, as partly demonstrated in the numerical study, the proposed approach is also sufficiently flexible and does not demand an accurate assessment of the similarity across datasets.
Magnitude-based contrasted penalization
When multiple datasets have a high degree of similarity, for example, when coming from the same design but independently conducted, it is reasonable to expect that the first direction vectors have similar magnitudes. As such, we propose a penalty that can shrink the differences of weights thus encourage the similarity of estimates within each group. More specifically, consider the magnitude-based contrasted penalty:
where μ2 > 0 is a tuning parameter. Overall, we refer to this approach as iSPLS-Homo(Hetero)M, with the subscript “M” standing for magnitude. Here, we adopt the penalty due to its simplicity and note that it can be replaced by some other penalties. It is also noted that this pairwise penalty has some connections with the fused penalty. The key difference is that our goal is to promote similarity not equality. As such, we adopt as opposed to penalty. Also, all pairwise differences are taken as opposed to only the adjacent ones.
Sign-based contrasted penalization
When multiple datasets are not “sufficiently close,” demanding quantitative similarity can be too stringent. As a weaker alternative, we propose encouraging qualitative similarity.14 In particular, we encourage the weights within each group to have similar signs. Consider the sign-based contrasted penalty:
where μ2 > 0 is a tuning parameter, and sgn(t) = −1, 0, or 1 if t < 0, t = 0, or t > 0. Note that the sign-based penalty is not continuous, which brings significant challenges to optimization. We further propose the following approximation to simplify optimization:
where τ > 0 is a small positive constant. There are other ways of approximating the sign function, for example, using the Sigmoid function. We conjecture that they are equally applicable here. The adopted approximation has a simple form and satisfactory performance.
Remarks
Penalties on the differences between magnitudes and signs have been considered in the literature,14,20,21 although under significantly different contexts. Coupling them with SPLS under the integrative analysis paradigm is new and novel. It may be relatively easy to comprehend the contrasted penalties under the homogeneity model. We note that, even under the heterogeneity model when different datasets may have different sparsity structures, they are still sensible. Specifically, they promote similarity, which can lead to simpler interpretations. In addition, it is noted that the contrasted penalties are sufficiently flexible—they do not force equality. As partly demonstrated in the numerical study below, they can accommodate datasets with different signs/magnitudes for important variables.
| Algorithm 1. Computational algorithm for iSPLS | |
|---|---|
In practical data analysis, choosing between the magnitude- and sign-based penalties (as well as choosing between the homogeneity and heterogeneity models) can be challenging. Although in principle this can be achieved by examining the “level of similarity” (eg, the magnitude-based for a higher level of similarity), it is not always straightforward. One possibility is to start with the most flexible approach and then examine the possibility of strengthening. For example, if the sign-based analysis shows similar magnitudes of estimates, then the magnitude-based analysis can be further conducted. It can also be achieved via statistically comparing analysis results under different approaches, as partly shown in our data analysis.
2.3 ∣. Computation
For optimizing the proposed approaches, we develop iterative algorithms. Similar to in the literature,5 we optimize w(l) and c(l) iteratively for l = 1, … , L. With fixed tuning and regularization parameters, optimization is summarized in Algorithm 1.
In Algorithm 1, the key is Step 2. For Step 2(a), with fixed , the objective function in problem (4) is equivalent to:
| (5) |
which does not involve the group part. Thus, we can optimize w(l) for each dataset separately. Problem (5) can be rewritten as:
where κ′ = (1 – κ)/(1 – 2κ). Then, by the method of Lagrangian multipliers, we have:
where the multiplier λ*(l) is the solution of .
For Step 2(b), when solving c(l) for fixed , problem (4) becomes:
We adopt the coordinate descent (CD) approach,22,23 which is a popular technique for penalized optimization and minimizes the objective function with respect to one group of coefficients at a time and cycles through all groups. This transforms a complicated minimization problem into a series of simple ones. In what follows, we first describe the CD algorithm for the heterogeneity model with the sign-based contrasted penalty. The computational algorithms for the homogeneity model and heterogeneity model with a magnitude-based contrasted penalty are described in the Appendix.
2.3.1 ∣. Computational algorithm for iSPLS-HeteroS
Consider the heterogeneity model with the sign-based contrasted penalty:
| (6) |
For j = 1, … , p, given the group parameter vectors fixed at their current estimates , we minimize objective function (6) with respect to . λ here is required to be very large because Z(l) is a p × q matrix with a relatively small q.5 With λ = ∞, we take the first order Taylor expansion with respect to for the first penalty. Then the problem is approximately equivalent to minimizing:
where and .
can be updated as follows. For l = 1, … , L,
Initialize r = 0 and .
-
Update r = r + 1.
Compute:
where
and . Repeat Step 2 until convergence. The estimate at convergence is .
2.3.2 ∣. Tuning parameter selection
iSPLS-HeteroS involves regularization parameters a, b. Breheny and Huang24 suggest setting them connected in a manner to ensure that the group level penalty attains its maximum if and only if all of its components are at the maximum. Following published studies, we set a = 6. With the link between the inner and outer penalties of the composite penalties, we set . iSPLS-HomoS only involves regularization parameters a, which is also set to be 6. We use cross-validation to choose tuning parameters μ1 and μ2. Furthermore, iSPLS-HeteroS involves τ. In our study, we fix the value with τ2 = 0.5 and find satisfactory performance. The literature25 and our own examination suggest that this value is not critical as long as it is small enough. In other applications, to be prudent, we suggest also looking into other values.
2.3.3 ∣. Remarks
Optimization with high-dimensional penalization is a moving field. There can be techniques more efficient than CD. We adopt CD because of its simplicity. With simple updates, the proposed algorithms are computationally affordable. The analysis of one simulation replicate (with details described below), including tuning parameter selection, takes less than 20 minutes on a regular desktop. And our brief exploration suggests that computational cost increases linearly with data size. To facilitate data analysis, we have developed an R program and made it publicly available at www.github.com/shuanggema.
3 ∣. SIMULATION
We simulate four independent studies each with sample size 40 or 120, and five response variables. For each study, we simulate 100 predictor variables, which are jointly normally distributed, with marginal means zero and variances one. The predictor variables have an autoregressive correlation structure, where variables j and k have correlation coefficient ρ∣j–k∣, and ρ = 0.2 and 0.7, corresponding to weak and strong correlations, respectively. Under all scenarios, the model Y(l) = X(l)β(l) + ϵ(l) is satisfied, where ϵ(l) is normally distributed with mean zero. Following the literature,5 the columns of , for i=2, … , 5, are generated by . The sparsity structures of direction vectors w(l) are controlled by . Within each dataset, the number of variables associated with the responses is set to be 10. The nonzero coefficients range from 0.5 to 4. We simulate under both the homogeneity and heterogeneity models.
Under the homogeneity model, the direction vectors have the same sparsity structure, with similar (Scenario 1) or different (Scenario 2) nonzero values. Under the heterogeneity model, two scenarios are considered. In Scenario 3, the four datasets share five important variables in common, and the rest important variables are dataset specific. That is, the direction vectors have partially overlapping sparsity structures. In Scenario 4, the direction vectors have randomly placed important variables. As such, the sparsity structures have random overlapping. These four scenarios comprehensively cover different degrees of similarity across datasets.
To evaluate the accuracy of variable selection, we first calculate the numbers of true/false positive/negative. Then sensitivity is calculated as (true positive)/(true positive + false negative), and specificity is calculated as (true negative)/(true negative + false positive). For each simulation replicate (training), we generate independent testing data under the same setting. Then with the training data estimates and testing data, we compute the mean squared prediction errors (MSPE), defined as the mean of the squared difference between the observed and predicted response values. Beyond the proposed approach, we also consider the following alternative approaches: (a) meta-analysis. We first analyze each dataset separately using PLS or SPLS. In evaluating identification accuracy, variables that are identified in at least one dataset are concluded as identified, and then sensitivity and specificity can be calculated. MSPE is first computed for each dataset separately, and then averaged across datasets; (b) a pooled approach. The four datasets are pooled together and analyzed by SPLS as a whole. The above summary measures only need to be calculated once for the pooled data. For the proposed and alternative approaches, the tuning parameters are selected via 5-fold cross-validation.
Summary statistics based on 200 replicates are presented in Table 1 for Scenario 1. The rest of the results are presented in Tables A1 to A3 in the Appendix. Across the simulated settings, we clearly observe competitive performance of the proposed approach. More specifically, under the homogeneity model, when the magnitudes of the nonzero values are similar across datasets (Scenario 1), iSPLS-HomoM has the most competitive performance. For example, in Table 1, with ρ = 0.2 and nl = 120, MSPEs are 49.062 (meta-PLS), 5.686 (meta-SPLS), 1.350 (pooled-SPLS), 2.002 (iSPLS-HomoM), 2.414 (iSPLS-HomoS), 3.368 (iSPLS-HeteroM), and 3.559 (iSPLS-HeteroS), respectively. Note that under Scenario 1, the performance of iSPLS-HomoM and iSPLS-HomoS may be slightly inferior to that of pooled-SPLS. In this simulation, we have generated the four datasets to be fully comparable which favors the pooled analysis. This level of comparability is not realistic in practice. When the nonzero values are sensibly different across datasets (Scenario 2), as can be seen from Table A1, iSPLS-HomoS outperforms the others including pooled-SPLS. Under the heterogeneity model with partial overlapping (Scenario 3), iSPLS-HeteroM and iSPLS-HeteroS seem to have better performance. For example, when ρ = 0.7 and nl = 40, they have higher sensitivity values (0.821 and 0.821, compared with 0.675, 0.575, 0.800, and 0.800 of the alternatives), smaller MSPEs (24.637 and 23.734, compared with 268.880, 30.928, 84.875, 40.867, and 39.492 of the alternatives), and with similar specificity values. Under Scenario 4, which is expected to have small to no overlapping and does not favor integrative analysis, the proposed approach is still observed to have reasonable performance. Similar observations have been made in published integrative analysis under significantly different contexts.
TABLE 1.
Simulation results for Scenario 1
| ρ | nl | Method | MSPE | Sensitivity | Specificity |
|---|---|---|---|---|---|
| 0.2 | 40 | Meta-PLS | 48.972 (4.676) | 1 (0) | 0 (0) |
| Meta-SPLS | 24.739 (3.879) | 0.632 (0.135) | 0.873 (0.111) | ||
| Pooled-SPLS | 4.377 (2.486) | 0.810 (0.127) | 0.999 (0.003) | ||
| iSPLS-HomoM | 9.452 (4.369) | 0.840 (0.110) | 0.982 (0.018) | ||
| iSPLS-HomoS | 10.151 (4.027) | 0.837 (0.119) | 0.980 (0.022) | ||
| iSPLS-HeteroM | 18.287 (6.022) | 0.845 (0.152) | 0.757 (0.063) | ||
| iSPLS-HeteroS | 15.462 (6.251) | 0.875 (0.143) | 0.743 (0.060) | ||
| 0.2 | 120 | Meta-PLS | 49.062 (4.151) | 1 (0) | 0 (0) |
| Meta-SPLS | 5.686 (2.056) | 0.799 (0.053) | 0.994 (0.007) | ||
| Pooled-SPLS | 1.350 (1.229) | 0.937 (0.025) | 0.999 (0.000) | ||
| iSPLS-HomoM | 2.002 (0.920) | 0.993 (0.008) | 0.956 (0.016) | ||
| iSPLS-HomoS | 2.414 (0.951) | 0.997 (0.008) | 0.929 (0.014) | ||
| iSPLS-HeteroM | 3.368 (1.211) | 0.955 (0.039) | 0.945 (0.019) | ||
| iSPLS-HeteroS | 3.559 (1.297) | 0.982 (0.051) | 0.872 (0.007) | ||
| 0.7 | 40 | Meta-PLS | 106.532 (7.066) | 1 (0) | 0 (0) |
| Meta-SPLS | 16.212 (4.033) | 0.828 (0.063) | 0.962 (0.011) | ||
| Pooled-SPLS | 5.984 (1.939) | 0.893 (0.065) | 0.984 (0.037) | ||
| iSPLS-HomoM | 6.956 (1.885) | 0.967 (0.018) | 0.947 (0.021) | ||
| iSPLS-HomoS | 7.000 (2.067) | 0.967 (0.018) | 0.946 (0.020) | ||
| iSPLS-HeteroM | 13.630 (3.817) | 0.896 (0.109) | 0.946 (0.019) | ||
| iSPLS-HeteroS | 13.855 (3.778) | 0.909 (0.112) | 0.942 (0.020) | ||
| 0.7 | 120 | Meta-PLS | 102.629 (9.225) | 1 (0) | 0 (0) |
| Meta-SPLS | 4.824 (1.913) | 0.912 (0.049) | 0.985 (0.012) | ||
| Pooled-SPLS | 2.454 (1.481) | 0.883 (0.056) | 0.994 (0.023) | ||
| iSPLS-HomoM | 2.292 (0.829) | 0.987 (0.018) | 0.977 (0.014) | ||
| iSPLS-HomoS | 2.356 (0.785) | 0.987 (0.018) | 0.976 (0.014) | ||
| iSPLS-HeteroM | 3.718 (0.995) | 0.988 (0.051) | 0.948 (0.013) | ||
| iSPLS-HeteroS | 3.609 (1.077) | 0.997 (0.035) | 0.942 (0.012) |
Note: In each cell, mean (SD).
Overall, when data are generated under the homogeneity model, iSPLS-HomoM and iSPLS-HomoS excel. In contrast, under the heterogeneity model, iSPLS-HeteroS and iSPLS-HeteroM are preferred. When the datasets have a higher degree of similarity, the magnitude-based contrasted penalty outperforms. Otherwise, the sign-based contrasted penalty is needed. These observations fit the designs of the methods. In practical data analysis, as all levels of similarity in sparsity structure and regression coefficients can exist, all the proposed methods are needed.
4 ∣. DATA ANALYSIS
4.1 ∣. Analysis of cutaneous melanoma data
We analyze three datasets from the TCGA cutaneous melanoma (SKCM) study, corresponding to three tumor stages, with 70 stage I samples, 60 stage II samples, and 110 stage III and IV samples. In the literature, analysis has been conducted on Breslow thickness, which is an important prognostic marker and regulated by gene expressions. This response variable has a continuous distribution and has been analyzed using linear regression. Most of the existing studies have combined samples with different stages and ignored their significant clinical differences. In our analysis, we acknowledge their differences and treat the three stages as three separate “studies.” On the other hand, it is also noted that as all samples are SKCM and data has been collected under the same protocol, similarity is expected, which forms the basis of the integrative analysis. A total of 18 947 gene expressions have been measured for all samples. In principle, the proposed analysis can be directly conducted on these genes. Considering the limited sample size, we select a subset of “interesting” genes for analysis. In particular, in a recent study,26 a gene expression network is first constructed, and the network communities (also referred to as modules) are identified. This analysis accounts for the community structure in regularized regression, and suggests that 126 genes are potentially associated with Breslow thickness. In our integrative analysis, we focus on these 126 genes. As in an “ordinary” PLS/SPLS analysis, our goal is to identify the first direction vectors that can best link gene expressions with Breslow thickness. The prominent difference is that now three datasets are simultaneously analyzed under integrative analysis.
We apply the proposed integrative analysis methods and their competitors considered in the simulation. It is noted with significantly inferior performance, the meta-PLS method is not considered, leading to a total of six methods. In Figure 1, we show representative analysis results (for communities 3, 5, and 42) - we refer to the literature26 for detailed information on these gene communities. Here each row corresponds to one dataset/stage, and each column corresponds to one community. It is easy to see that different methods lead to significantly different estimation and identification. This is further shown in Table 2, where we compare findings from the proposed and alternative methods and present the numbers of overlapping genes identified by different methods. Further examining the estimation details (available from the authors) suggests that the results of different methods fit their design. For example, the homogeneity methods lead to the same identification across datasets, and the sign-based methods lead to estimates with more consistent signs.
FIGURE 1.
Analysis of SKCM data. Rhombus and cross correspond to iSPLS-Homo and iSPLS-Hetero, respectively. Blue and orange correspond to magnitude- and sign-based penalties, respectively. Pink cross and red circle correspond to meta-SPLS and pooled-SPLS
TABLE 2.
Data analysis: Numbers of overlapping genes identified by different methods
| iSPLS |
||||||
|---|---|---|---|---|---|---|
| Pooled-SPLS | Meta-SPLS | HomoM | HomoS | HeteroM | HeteroS | |
| SKCM data | ||||||
| Pooled-SPLS | 100 | 34 | 20 | 21 | 51 | 53 |
| Meta-SPLS | 107 | 28 | 29 | 71 | 72 | |
| iSPLS-HomoM | 45 | 45 | 45 | 45 | ||
| iSPLS-HomoS | 46 | 46 | 46 | |||
| iSPLS-HeteroM | 83 | 75 | ||||
| iSPLS-HeteroS | 89 | |||||
| Lung cancer data | ||||||
| Pooled-SPLS | 145 | 78 | 37 | 40 | 66 | 51 |
| Meta-SPLS | 92 | 39 | 42 | 76 | 58 | |
| iSPLS-HomoM | 39 | 39 | 38 | 35 | ||
| iSPLS-HomoS | 42 | 40 | 36 | |||
| iSPLS-HeteroM | 66 | 58 | ||||
| iSPLS-HeteroS | 72 | |||||
In practical data analysis, it is not feasible to evaluate identification and estimation accuracy as in simulation. To provide an “indirect” support, we evaluate prediction performance and stability of identification. Specifically, we first randomly partition each dataset into a training and a testing set, with sizes 3:1. Estimation is conducted on the training set using the proposed and alternative methods. Prediction is then made on the testing set. The root mean squared error (RMSE) is used to measure prediction performance. This process is repeated 100 times. Furthermore, for each gene, we compute its observed occurrence index (OOI),27 which is its probability of being identified with the 100 random partitions. Then, for those genes identified using the whole data without partition, we compute the median of their OOI values, which provides an overall stability measure, with a higher value suggesting more stable analysis. The results are summarized in Table 3. Integrative analysis under the heterogeneity model has much smaller RMSEs, suggesting the significant differences across stages. Note that the differences in the genetic basis for different stages have been noted in the literature. Considering both prediction and stability, iSPLS-HeteroS seems to have the best performance.
TABLE 3.
Data analysis: RMSE and median OOI
| iSPLS |
||||||
|---|---|---|---|---|---|---|
| Pooled-SPLS | Meta-SPLS | HomoM | HomoS | HeteroM | HeteroS | |
| SKCM data | ||||||
| RMSE | 6.210 | 4.046 | 4.202 | 4.163 | 3.202 | 3.135 |
| OOI | 0.76 | 0.75 | 0.80 | 0.80 | 0.77 | 0.78 |
| Lung cancer data | ||||||
| RMSE | 32.367 | 27.837 | 22.269 | 20.412 | 21.019 | 20.318 |
| OOI | 0.71 | 0.73 | 0.78 | 0.78 | 0.76 | 0.75 |
4.2 ∣. Analysis of lung cancer data
We collect two lung cancer datasets from TCGA, on Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC), with sample sizes equal to 142 and 89, respectively. As two subtypes of non-small-cell lung cancer, it is sensible to expect some common ground shared by LUAD and LUSC, making it reasonable to conduct integrative analysis. On the other hand, significant differences between the two subtypes have been well noted in the literature. For the response variable, we consider FEV1, which measures how much air a person can exhale during a forced breath and is an important marker of lung capacity. For predictors, we also consider gene expressions. The analysis goal is to identify the first direction vectors that can best link gene expressions with FEV1. Similar to the above analysis, here we also follow the literature26 and focus on 474 genes from 26 network communities, which have been suggested as potentially related to FEV1.
Analysis is conducted in the same manner as above. Representative results are shown in Figure 2, and summary comparison results are provided in Table 2, both of which show significant differences across different methods. More detailed estimation and selection results are available from the authors, which again suggest that the results fit the methods’ designs. An ad hoc comparison suggests that the similarity in gene identification, as partly reflected in Table 2, is higher than that for the SKCM data. In the evaluation as presented in Table 3, iSPLS-HomoS and iSPLS-HeteroS are observed to have similar prediction performance, while iSPLS-HomoS has slightly higher stability. Such results suggest that the two subtypes are qualitatively but not quantitatively similar.
FIGURE 2.
Analysis of lung cancer data. Rhombus and cross correspond to iSPLS-Homo and iSPLS-Hetero, respectively. Blue and orange correspond to magnitude- and sign-based penalties, respectively. Pink cross and red circle correspond to meta-SPLS and pooled-SPLS
4.3 ∣. Remarks
Practical data is usually more complicated and has weaker signals than simulated data, leading to more prominent differences between different methods. This has also been observed in integrative analysis under other contexts. For both SKCM and lung cancer, there are extensive literature on clinical differences between stages and subtypes. However, differences in genes regulating phenotypes have not been well studied. As such, although the three SKCM stages may seem clinically “more similar” than the two lung cancer subtypes, it is unclear whether genes’ effects on the specific phenotype should also be “more similar.” Also with a lack of related research, it is unclear, for example, whether iSPLS-HeteroS, which is statistically superior, fits the underlying biology of SKCM. For the lung cancer data, we also note that the prediction and stability measures may not be sufficient to make a definitive recommendation on the optimal method. As with most statistical analyses, additional analysis/information will be needed.
5 ∣. DISCUSSION
The PLS/SPLS technique has been established as highly useful for the analysis of data with multi- and high-dimensional variables. In this study, we have advanced it by conducting the integrative analysis of multiple datasets with comparable designs. “Marrying” PLS/SPLS with integrative analysis may seem conceptually simple, but this is the first time it is pursued. Advancing from the “standard” integrative analysis, we have also introduced the magnitude and sign based contrasted penalization to further accommodate the interconnections among datasets. Effective computational algorithms have been developed, and simulation demonstrates satisfactory performance of the proposed approaches. In the analysis of TCGA data, findings different from the alternatives have been made, with improved prediction and satisfactory stability.
The four proposed methods have been designed for different data settings. Our simulation shows that performance of different methods is data-dependent, and all methods are needed to cope with various possible data settings encountered in practice. In practical data analysis, although carefully examining data may provide some suggestions on the similarity of sparsity structure and magnitude of estimates, it is impossible to make definitive conclusions. Similar to in our data analysis, we recommend a “trial and error” approach. That is, for a specific collection of datasets, all four approaches are applied. Then examining the estimation, prediction, and stability results may provide some hint into which approach may be more appropriate. In addition, we can also examine estimates under the “loosest” approach (HeteroS). If similar magnitudes are observed, then the magnitude-based penalty can be applied; and if sparsity structures are similar, then the homogeneity model can be applied. In a few published integrative analysis studies, theoretical investigation has been conducted. However, our preliminary examination suggests that the theoretical aspects of SPLS differ significantly from regression, PCA, and many other techniques. Establishing the theoretical properties of iSPLS may demand developing new and foundational theories. This is postponed to future research.
ACKNOWLEDGEMENTS
The authors thank the editor and reviewer for their careful review and insightful comments, which have led to a significant improvement of this article. This study was supported by the National Natural Science Foundation of China (11971404, 71988101), 111 Project (B13028), Fundamental Research Funds for the Central Universities (20720181003), National Institutes of Health (CA121974, CA196530), and a Yale Cancer Center Pilot Award.
Funding information
111 Project, Grant/Award Number: B13028; Fundamental Research Funds for the Central Universities, Grant/Award Number: 20720181003; National Institutes of Health, Grant/Award Number: CA121974, CA196530; National Natural Science Foundation of China, Grant/Award Number: 11971404, 71988101
Appendix
APPENDIX A. ALGORITHM
Computational algorithm for iSPLS-HomoM
The overall strategy is similar to that described in the main text. The key difference lies in Step 2(b), where we solve c(l) with fixed . Consider the objective function:
| (A1) |
For j = 1, … , p, given the group parameter vectors fixed at their current estimates , we minimize objective function (A1) with respect to . Similar to in Section 2.3.1, it is equivalent to minimizing:
| (A2) |
TABLE A1.
Simulation results for Scenario 2
| ρ | nl | Method | MSPE | Sensitivity | Specificity |
|---|---|---|---|---|---|
| 0.2 | 40 | Meta-PLS | 87.769 (14.532) | 1 (0) | 0 (0) |
| Meta-SPLS | 31.173 (8.422) | 0.532 (0.074) | 0.919 (0.073) | ||
| Pooled-SPLS | 33.533 (4.519) | 0.883 (0.095) | 0.976 (0.042) | ||
| iSPLS-HomoM | 17.567 (5.086) | 0.993 (0.025) | 0.681 (0.084) | ||
| iSPLS-HomoS | 16.881 (4.548) | 0.993 (0.025) | 0.681 (0.084) | ||
| iSPLS-HeteroM | 28.803 (6.574) | 0.756 (0.122) | 0.774 (0.057) | ||
| iSPLS-HeteroS | 25.990 (5.446) | 0.819 (0.102) | 0.739 (0.063) | ||
| 0.2 | 120 | Meta-PLS | 85.138 (4.172) | 1 (0) | 0 (0) |
| Meta-SPLS | 9.015 (1.283) | 0.672 (0.054) | 0.994 (0.005) | ||
| Pooled-SPLS | 27.068 (0.867) | 0.993 (0.076) | 1.000 (0.003) | ||
| iSPLS-HomoM | 3.673 (0.552) | 1.000 (0.025) | 0.983 (0.024) | ||
| iSPLS-HomoS | 3.589 (0.649) | 1.000 (0.018) | 0.982 (0.043) | ||
| iSPLS-HeteroM | 6.050 (0.555) | 0.898 (0.040) | 0.956 (0.024) | ||
| iSPLS-HeteroS | 6.674 (0.776) | 0.949 (0.030) | 0.939 (0.032) | ||
| 0.7 | 40 | Meta-PLS | 192.366 (10.990) | 1 (0) | 0 (0) |
| Meta-SPLS | 28.179 (5.592) | 0.652 (0.078) | 0.981 (0.015) | ||
| Pooled-SPLS | 65.284 (5.221) | 0.970 (0.101) | 0.963 (0.018) | ||
| iSPLS-HomoM | 10.186 (4.096) | 0.997 (0.055) | 0.948 (0.023) | ||
| iSPLS-HomoS | 9.909 (4.031) | 0.997 (0.055) | 0.947 (0.022) | ||
| iSPLS-HeteroM | 23.300 (9.108) | 0.741 (0.096) | 0.953 (0.017) | ||
| iSPLS-HeteroS | 22.974 (9.806) | 0.765 (0.095) | 0.950 (0.019) | ||
| 0.7 | 120 | Meta-PLS | 175.348 (8.390) | 1 (0) | 0 (0) |
| Meta-SPLS | 14.871 (1.943) | 0.745 (0.041) | 0.975 (0.010) | ||
| Pooled-SPLS | 61.626 (0.758) | 0.963 (0.059) | 0.986 (0.008) | ||
| iSPLS-HomoM | 5.923 (0.913) | 0.997 (0.035) | 0.972 (0.012) | ||
| iSPLS-HomoS | 5.764 (0.913) | 0.997 (0.035) | 0.971 (0.013) | ||
| iSPLS-HeteroM | 10.742 (1.252) | 0.911 (0.016) | 0.917 (0.012) | ||
| iSPLS-HeteroS | 9.354 (1.267) | 0.946 (0.009) | 0.912 (0.011) |
Note: In each cell, mean (SD).
It can be shown that the minimizer of (A2) is:
where and .
Computational algorithm for iSPLS-HeteroM
Consider the optimization problem:
| (A3) |
TABLE A2.
Simulation results for Scenario 3
| ρ | nl | Method | MSPE | Sensitivity | Specificity |
|---|---|---|---|---|---|
| 0.2 | 40 | Meta-PLS | 76.919 (11.918) | 1 (0) | 0 (0) |
| Meta-SPLS | 35.372 (6.920) | 0.551 (0.118) | 0.900 (0.087) | ||
| Pooled-SPLS | 54.006 (6.920) | 0.675 (0.289) | 0.730 (0.289) | ||
| iSPLS-HomoM | 28.495 (4.416) | 0.900 (0.057) | 0.589 (0.062) | ||
| iSPLS-HomoS | 27.897 (4.231) | 0.900 (0.069) | 0.589 (0.070) | ||
| iSPLS-HeteroM | 23.201 (5.788) | 0.800(0.137) | 0.847 (0.042) | ||
| iSPLS-HeteroS | 21.616 (5.632) | 0.800(0.134) | 0.856 (0.039) | ||
| 0.2 | 120 | Meta-PLS | 84.613 (16.931) | 1 (0) | 0 (0) |
| Meta-SPLS | 10.995 (2.382) | 0.696 (0.082) | 0.990 (0.010) | ||
| Pooled-SPLS | 44.243 (3.532) | 0.600 (0.190) | 0.847 (0.125) | ||
| iSPLS-HomoM | 12.445 (1.995) | 0.902 (0.050) | 0.683 (0.049) | ||
| iSPLS-HomoS | 12.471 (1.993) | 0.908 (0.049) | 0.674 (0.058) | ||
| iSPLS-HeteroM | 8.699 (1.768) | 0.882 (0.050) | 0.926 (0.016) | ||
| iSPLS-HeteroS | 8.467 (1.826) | 0.882 (0.049) | 0.931 (0.015) | ||
| 0.7 | 40 | Meta-PLS | 268.880 (12.323) | 1 (0) | 0 (0) |
| Meta-SPLS | 30.928 (7.532) | 0.675 (0.084) | 0.939 (0.022) | ||
| Pooled-SPLS | 84.875 (7.594) | 0.575 (0.152) | 0.909 (0.073) | ||
| iSPLS-HomoM | 40.867 (6.147) | 0.800 (0.084) | 0.700 (0.152) | ||
| iSPLS-HomoS | 39.492 (5.919) | 0.800 (0.080) | 0.700 (0.171) | ||
| iSPLS-HeteroM | 24.637 (6.887) | 0.821 (0.102) | 0.900 (0.051) | ||
| iSPLS-HeteroS | 23.734 (6.373) | 0.825 (0.111) | 0.911 (0.068) | ||
| 0.7 | 120 | Meta-PLS | 258.583 (8.390) | 1 (0) | 0 (0) |
| Meta-SPLS | 12.631 (2.791) | 0.900 (0.062) | 0.971 (0.011) | ||
| Pooled-SPLS | 73.999 (6.112) | 0.800 (0.138) | 0.772 (0.135) | ||
| iSPLS-HomoM | 20.475 (3.493) | 0.998 (0.010) | 0.364 (0.066) | ||
| iSPLS-HomoS | 20.463 (3.445) | 0.998 (0.010) | 0.364 (0.066) | ||
| iSPLS-HeteroM | 10.228 (2.837) | 0.988 (0.019) | 0.895 (0.022) | ||
| iSPLS-HeteroS | 10.113 (2.818) | 0.988 (0.062) | 0.895 (0.011) |
Note: In each cell, mean (SD).
Take the first order Taylor expansion approximation with respect to for the first penalty, with fixed at their current estimates . Carry out the same procedure to the second penalty as in Section 2.3.1. Then the objective function (A3) is approximately equivalent to minimizing:
| (A4) |
where .
. can be updated as follows. For l = 1, … , L,
-
Initialize r=0 and .
TABLE A3.
Simulation results for Scenario 4ρ nl Method MSPE Sensitivity Specificity 0.2 40 Meta-PLS 203.530 (25.691) 1 (0) 0 (0) Meta-SPLS 92.530 (21.452) 0.580 (0.105) 0.880 (0.096) Pooled-SPLS 174.432 (14.915) 0.491 (0.300) 0.608 (0.288) iSPLS-HomoM 100.245 (14.246) 0.852 (0.067) 0.404 (0.084) iSPLS-HomoS 98.013 (14.990) 0.851 (0.064) 0.403 (0.073) iSPLS-HeteroM 79.508 (19.775) 0.633 (0.111) 0.918 (0.026) iSPLS-HeteroS 81.041 (19.177) 0.626 (0.106) 0.920 (0.025) 0.2 120 Meta-PLS 233.403 (41.459) 1 (0) 0 (0) Meta-SPLS 26.850 (5.595) 0.689 (0.060) 0.994 (0.067) Pooled-SPLS 155.342 (12.988) 0.500 (0.098) 0.717 (0.041) iSPLS-HomoM 42.005 (5.538) 0.914 (0.047) 0.496 (0.055) iSPLS-HomoS 41.962 (5.566) 0.913 (0.047) 0.498 (0.063) iSPLS-HeteroM 24.120 (4.955) 0.878 (0.076) 0.925 (0.025) iSPLS-HeteroS 24.177 (4.865) 0.88 (0.077) 0.926 (0.018) 0.7 40 Meta-PLS 542.745 (91.125) 1 (0) 0 (0) Meta-SPLS 69.596 (22.876) 0.654 (0.089) 0.974 (0.024) Pooled-SPLS 357.967 (34.464) 0.401 (0.236 ) 0.753 (0.219) iSPLS-HomoM 100.322 (17.976) 0.937 (0.055) 0.437 (0.067) iSPLS-HomoS 97.774 (20.713) 0.937 (0.054) 0.436 (0.069) iSPLS-HeteroM 66.089 (16.784) 0.904 (0.083) 0.776 (0.041) iSPLS-HeteroS 64.131 (16.378) 0.904 (0.089) 0.771 (0.024) 0.7 120 Meta-PLS 636.34 (73.501) 1 (0) 0 (0) Meta-SPLS 35.067 (8.960) 0.872 (0.015) 0.954 (0.057) Pooled-SPLS 331.250 (9.337) 0.469 (0.110) 0.773 (0.075) iSPLS-HomoM 56.381 (11.017) 0.992 (0.047) 0.465 (0.018) iSPLS-HomoS 56.234 (11.021) 0.993 (0.047) 0.461 (0.019) iSPLS-HeteroM 31.622 (8.855) 0.943 (0.066) 0.913 (0.016) iSPLS-HeteroS 30.625 (8.501) 0.943 (0.063) 0.911 (0.017) Note: In each cell, mean (SD). - Update r=r+1. Compute:
where
and . Repeat Step 2 until convergence. The estimate at convergence is .
Computational algorithm for iSPLS-HomoS
Consider the optimization problem:
| (A5) |
For j=1, … , p, following the same procedure as in Section 2.3.1, we have the following minimization problem:
| (A6) |
It can be shown that the minimizer of (A6) is:
where
| (A7) |
and .
can be updated as follows: For l=1, … , L,
Initialize r=0 and .
- Update r=r+1. Compute:
Repeat Step 2 until convergence. The estimate at convergence is .
Footnotes
CONFLICT OF INTEREST
The authors declare no conflict of interests.
DATA AVAILABILITY STATEMENT
Data analyzed in this study are publicly available from the TCGA website.
REFERENCES
- 1.Wold S, Ruhe A, Wold H, Dunn WJ. The collinearity problem in linear regression. the partial least squares (PLS) approach to generalized inverses. SIAM J Sci Stat Comput. 1984;5:735–743. [Google Scholar]
- 2.Sjöström M, Wold S, Lindberg W, Persson J-Å, Martens H. A multivariate calibration problem in analytical chemistry solved by partial least squares models in latent variables. Anal Chim Acta. 1983;150:61–70. [Google Scholar]
- 3.Chun H, Keleş S. Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics. 2009;182:79–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Fan J, Lv J. A selective overview of variable selection in high dimensional feature space. Stat Sin. 2010;20:101–148. [PMC free article] [PubMed] [Google Scholar]
- 5.Chun H, Keleş S. Sparse partial least squares regression for simultaneous dimension reduction and variable selection. J Royal Stat Soc Ser B (Stat Methodol). 2010;72:3–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Guerra R, Goldstein DR. Meta-Analysis and Combining Information in Genetics and Genomics. Boca Raton, FL: CRC Press; 2009. [Google Scholar]
- 7.Liu J, Huang J, Zhang Y, et al. Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics. 2014;70:480–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Ma S, Huang J, Song X. Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics. 2011;12:763–775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Grützmann R, Boriss H, Ammerpohl O, et al. Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene. 2005;24:5079–5088. [DOI] [PubMed] [Google Scholar]
- 10.Zhao Q, Shi X, Huang J, Liu J, Li Y, Ma S. Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdiscip Rev Comput Stat. 2015;7:99–108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.De Jong S SIMPLS: an alternative approach to partial least squares regression. Chemom Intell Lab Syst. 1993;18:251–263. [Google Scholar]
- 12.Ter Braak CJF, Jong S. The objective function of partial least squares regression. J Chemom J Chemom Soc. 1998;12:41–54. [Google Scholar]
- 13.Jolliffe IT, Trendafilov NT, Uddin M. A modified principal component technique based on the LASSO. J Comput Graph Stat. 2003;12:531–547. [Google Scholar]
- 14.Fang K, Fan X, Zhang Q, Ma S. Integrative sparse principal component analysis. J Multivar Anal. 2018;166:1–16. [Google Scholar]
- 15.Huang Y, Huang J, Shia B-C, Ma S. Identification of cancer genomic markers via integrative sparse boosting. Biostatistics. 2012;13:509–522. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Wu C, Zhou F, Ren J, Li X, Jiang Y, Ma S. A selective review of multi-level omics data integration using variable selection. High Throughput. 2019;8:4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Shi X, Liu J, Huang J, Zhou Y, Shia BC, Ma S. Integrative analysis of high-throughput cancer studies with contrasted penalization. Genet Epidemiol. 2014;38:144–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. Ann Stat. 2010;38:894–942. [Google Scholar]
- 19.Huang J, Breheny P, Ma S. A selective review of group selection in high-dimensional models. Stat Sci Rev J Inst Math Stat. 2012;27:481–499. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Chiquet J, Grandvalet Y, Ambroise C. Inferring multiple graphical structures. Stat Comput. 2011;21:537–553. [Google Scholar]
- 21.Wang F, Wang L, Song PX-K. Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics. 2016;72:1184–1193. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Friedman J, Hastie T, Tibshirani R. Regularized paths for generalized linear models via coordinate descent. J Stat Softw. 2010;33:1–22. [PMC free article] [PubMed] [Google Scholar]
- 23.Mazumder R, Friedman J, Hastie T. SparseNet: coordinate descent with non-convex penalties. JASA. 2011;106:1125–1138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Breheny P, Huang J. Penalized methods for bi-level variable selection. Stat Interface. 2009;2:369–380. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Dicker L, Huang B, Lin X. Variable selection and estimation with the seamless-L0 penalty. Stat Sin. 2013;23:929–962. [Google Scholar]
- 26.Sun Y, Jiang Y, Li Y, Ma S. Identification of cancer omics commonality and difference via community fusion. Stat Med. 2019;38:1200–1212. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Huang J, Ma S. Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Anal. 2010;16:176–195. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Data analyzed in this study are publicly available from the TCGA website.


