Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data

Yuan Huang; Jin Liu; Huangdi Yi; Ben-Chang Shia; Shuangge Ma

doi:10.1002/sim.7138

. Author manuscript; available in PMC: 2018 Feb 10.

Published in final edited form as: Stat Med. 2016 Sep 25;36(3):509–559. doi: 10.1002/sim.7138

Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data

Yuan Huang ^a, Jin Liu ^b, Huangdi Yi ^c, Ben-Chang Shia ^d, Shuangge Ma ^a,^*,^†

PMCID: PMC5209260 NIHMSID: NIHMS817902 PMID: 27667129

Abstract

In profiling studies, the analysis of a single dataset often leads to unsatisfactory results because of the small sample size. Multi-dataset analysis utilizes information of multiple independent datasets and outperforms single-dataset analysis. Among the available multi-dataset analysis methods, integrative analysis methods aggregate and analyze raw data and outperform meta-analysis methods, which analyze multiple datasets separately and then pool summary statistics. In this study, we conduct integrative analysis and marker selection under the heterogeneity structure, which allows different datasets to have overlapping but not necessarily identical sets of markers. Under certain scenarios, it is reasonable to expect some similarity of identified marker sets – or equivalently, similarity of model sparsity structures – across multiple datasets. However, the existing methods do not have a mechanism to explicitly promote such similarity. To tackle this problem, we develop a sparse boosting method. This method uses a BIC/HDBIC criterion to select weak learners in boosting and encourages sparsity. A new penalty is introduced to promote the similarity of model sparsity structures across datasets. The proposed method has an intuitive formulation and is broadly applicable and computationally affordable. In numerical studies, we analyze right censored survival data under the AFT (accelerated failure time) model. Simulation shows that the proposed method outperforms alternative boosting and penalization methods with more accurate marker identification. The analysis of three breast cancer prognosis datasets shows that the proposed method can identify marker sets with increased similarity across datasets and improved prediction performance.

Keywords: Integrative analysis, Model sparsity structure, Heterogeneity structure, Sparse boosting, Marker identification

1. Introduction

Profiling studies have been extensively conducted in the search for genetic markers associated with disease outcomes and phenotypes such as risk, progression, and response to treatment. Data generated in such studies have the “large d, small n” characteristic, with the number of covariates (for example gene expressions profiled) d much larger than the sample size n. Results generated from the analysis of a single dataset are often unsatisfactory [1]. Many factors contribute to the unsatisfactory results, with the most important one likely being the small sample sizes. Fortunately, for many diseases there are multiple datasets from independent studies with comparable designs. Multi-dataset analysis combines information across datasets, increases sample size, and can outperform single-dataset analysis. Multi-dataset analysis methods include meta-analysis [2, 3] and integrative analysis methods. In “classic” meta-analysis, multiple datasets are initially analyzed separately, and then summary statistics are pooled across datasets. In contrast, in integrative analysis, the raw data from multiple datasets are pooled and analyzed. Recent studies have shown that integrative analysis outperforms meta-analysis with more accurate marker identification [4, 5].

Consider the integrative analysis of M independent datasets with the same type of response variable. In dataset m(= 1, …, M), denote Y_m as the response variable, and X_m as the length-d vector of covariates (for example, gene expressions or SNPs). For simplicity of notation, it is assumed that the same set of covariates is measured in all M datasets. In whole-genome studies, the sets of covariates measured in different datasets are usually very similar. The rescaling approach [5] can easily accommodate covariates measured in some but not all datasets. As one of the main goals of multi-dataset analysis is to find the similarity/difference across datasets, integrative analysis may not be sensible if different datasets measure significantly different sets of covariates. In dataset m, assume n_m i.i.d. observations. Assume that $Y_{m} ~ ϕ (β_{m}^{'} X_{m})$ , where the form of model ϕ is known, and β_m is the length-d vector of unknown regression coefficients. Denote the jth component of β_m as β_m_,_j. Our goal is to identify markers associated with the response variables or, equivalently, to determine which β_m_,_j’s are nonzero.

The genetic basis of the M datasets, as measured by the identified marker sets, can be described using the homogeneity structure or the heterogeneity structure [5]. Under the homogeneity structure, the same set of markers is identified for all datasets, and so the M models have the same sparsity structure. That is, I(β_m_,_j = 0) = I(β_k_,_j = 0) for all m, k = 1, …, M and j = 1, …, d. The heterogeneity structure differs from the homogeneity one by allowing the M models to have possibly different sparsity structures. Here, it is possible that I(β_m_,_j = 0) ≠ I(β_k_,_j = 0) for some (j, m, k)’s. The heterogeneity structure includes the homogeneity structure as a special case and is more flexible [5, 6].

In this study, we conduct integrative analysis under the heterogeneity structure. Although multiple datasets are allowed to have different sets of markers, as the basis of integrating multiple datasets, it is reasonable to expect that they share some common markers. Further, under certain scenarios, it is of interest to promote the similarity of model sparsity structures across datasets. As the first example, consider multiple independent datasets generated under similar protocols [7]. Because of the experimental differences, the homogeneity structure, which requires the same model sparsity structure across datasets, can be too restrictive. However, as multiple datasets measure the same set of response variable and covariates, it is reasonable to expect and hence to encourage multiple datasets to have similar marker sets. The second example is the analysis of data on different response variables. For example in the study conducted by Liu and others [5], each dataset is on the risk of a different cancer type. Despite great differences across cancer types, multiple genes and pathways have been identified as associated with a large number of cancers [8]. Compared with cancer type-specific markers, those shared by multiple cancers are more likely to define the fundamental characteristics of cancer. Thus, in multi-cancer analysis it is also of interest to promote markers to be identified in multiple datasets. The existing methods do not have a mechanism that explicitly promotes the similarity of model sparsity structures across datasets.

In integrative analysis under the heterogeneity model, we adopt sparse boosting for marker selection and estimation. Sparse boosting, first developed by Buhlmann and Yu [9] and others, is a family of methods especially suitable for high-dimensional data and sparse models. This study differs from the existing sparse boosting studies [4, 9, 10] by conducting the integrative analysis of multiple datasets and by assuming the heterogeneity structure. The most significant advancement is the introduction of a new penalty in the boosting algorithm, which explicitly promotes the similarity of model sparsity structures across datasets. This penalty has a simple form and an intuitive interpretation.

2. Integrative Analysis and Marker Selection using Sparse Boosting

For dataset m(= 1, …, M), denote R_m(β_m) as the loss function. The most fundamental requirement on the loss is that it leads to a consistent estimate under the “classic” condition with n_m ≫ d. The most common choice is the negative likelihood function. For models such as the logistic, an intercept term is needed beyond β_m. We omit the intercept term as it will not be subject to selection and be very easy to deal with.

2.1. Sparse boosting a single dataset

As described above, with high-dimensional covariates, we focus on linear covariate effects. Under this setting, boosting assembles a set of individual covariates (weak learners) into a comprehensive model (a strong learner, e.g., an effective linear combination of covariates). Advantages of boosting include its simple and intuitive form, broad applicability, affordable computational cost, and satisfactory numerical performance. We refer to Buhlmann and Hothorn [11] and others for comprehensive reviews.

With ordinary boosting, marker selection is achieved with an early stopping. However, Buhlmann and Yu [9] and several other studies find that the ordinary boosting results may not be “sparse enough”. That is, too many covariates may be identified as associated with response. Sparse boosting is developed to tackle this problem. For the integrity of this article, we first present a version of sparse boosting based on Buhlmann and Yu [9] for the analysis of a single dataset, say dataset m.

Algorithm 0: Sparse boosting a single dataset

Step 1: Initialization. k = 0. Denote $β_{m}^{[k]}$ as the estimate of β_m in the kth iteration, and its jth component as $β_{m, j}^{[k]}$ . Initialize $β_{m, j}^{[k]} = 0$ for j = 1, …, d. With each component of X_m being a weak learner, the strong learner is $f_{m}^{[k]} = {β_{m}^{[k]}}^{'} X_{m}$ .
Step 2: Fit and update. k = k + 1.

Compute $(\hat{s}, \hat{γ}) = {argmin}_{1 \leq s \leq d, γ} {R_{m} (β_{m}^{[k - 1]} + γ 1_{s}) + pen (β_{m}^{[k - 1]} + γ 1_{s})}$ , where 1_s is the length-d vector with the sth component equal to 1 and all others equal to 0. pen(·) is the penalty function on model complexity (more details are provided below).

Update $β_{m, \hat{s}}^{[k]} = β_{m, \hat{s}}^{[k - 1]} + ν \hat{γ}$ and $f_{m}^{[k]} = f_{m}^{[k - 1]} + ν \hat{γ} X_{m, \hat{s}}$ , where ν is the step size. It has been suggested that the choice of ν is not critical as long as it is small [9]. We set ν = 0.1 following published studies.
Step 3: Iteration. Repeat Step 2 for K times. K is a large number.
Step 4: Selection of optimal stopping. At iteration k(= 1, …, K), compute $F_{m} (k) = R_{m} (β_{m}^{[k]}) + pen (β_{m}^{[k]})$ . Select the optimal number of iterations as k̂ = argmin_1≤_k_≤_KF_m(k). The final strong learner is $f_{m}^{[\hat{k}]}$ . Covariates corresponding to the nonzero components of $β_{m}^{[\hat{k}]}$ are identified as associated with the response.

Different from the ordinary boosting which uses R_m only, the sparse boosting introduces the penalty function pen(·) in selecting the weak learners and optimal stopping. By penalizing model complexity, it can lead to sparser models. Choices of pen(·) include BIC, AIC, MDL (minimum description length), and others [9]. In the literature there is still a lack of investigation on when a penalty is preferred over the others. To be more flexible, the model complexity penalties in weak learner selection and stopping can be different.

2.2. Integrative sparse boosting multiple datasets

Consider extending sparse boosting to the integrative analysis of M independent datasets. In the integrative analysis under the heterogeneity structure, two-level marker selection is needed [5]. Use similar notations as for Algorithm 0. We propose the following algorithm.

Algorithm 1: Sparse boosting for integrative analysis

Step 1: Initialization. k = 0. For m = 1, …, M, initialize $β_{m, j}^{[k]} = 0$ for j = 1, …, d. The strong learner for dataset m is $f_{m}^{[k]} = {β_{m}^{[k]}}^{'} X_{m}$ .
Step 2: Fit and update. k = k + 1. For m = 1, …, M:
- Compute $(\hat{s}, \hat{γ}) = {argmin}_{1 \leq s \leq d, γ} {R_{m} (β_{m}^{[k - 1]} + γ 1_{s}) + pen (β_{m}^{[k - 1]} + γ 1_{s})}$ .
- Update $β_{m, \hat{s}}^{[k]} = β_{m, \hat{s}}^{[k - 1]} + ν \hat{γ}$ and $f_{m}^{[k]} = f_{m}^{[k - 1]} + ν \hat{γ} X_{m, \hat{s}}$ .
Step 3: Iteration. Repeat Step 2 for K times. K is a large number.
Step 4: Selection of optimal stopping. At iteration k(= 1, …, K), compute $F (k) = \sum_{m} {F_{m} (k) = R_{m} (β_{m}^{[k]}) + pen (β_{m}^{[k]})}$ . Select the optimal number of iterations as k̂ = argmin_1≤_k_≤_KF(k). For dataset m, the final strong learner is $f_{m}^{[\hat{k}]}$ . Covariates corresponding to the nonzero components of $β_{m}^{[\hat{k}]}$ are identified as associated with the response in dataset m.

In integrative analysis, we need to jointly analyze M datasets. In each iteration, we consecutively apply sparse boosting to each dataset. When selecting the weak learners and updating the strong learners, multiple datasets are considered separately. However, the stopping rule is selected by jointly considering the M datasets. Loosely speaking, this amounts to applying a comparable amount of regularization to all datasets, which has been suggested in integrative analysis using the penalization technique [5, 6, 7].

2.3. Promoting the similarity of model sparsity structures

Algorithm 1 fully relies on data to determine how similar the sparsity structures are. However, there is no mechanism to encourage the similarity. Denote β_·,_j = (β_1,_j, …, β_M_,_j) as the jth column of β. We propose the following algorithm.

Algorithm 2: Sparse boosting for integrative analysis that promotes the similarity

Step 1: Initialization. The same as in Algorithm 1.
Step 2: Fit and update. k = k + 1. For m = 1, …, M:
- Compute
  $(\hat{s}, \hat{γ}) = {argmin}_{1 \leq s \leq d, γ} {R_{m} (β_{m}^{[k - 1]} + γ 1_{s}) + pen (β_{m}^{[k - 1]} + γ 1_{s}) + {pen}_{s} (β^{[k - 1]} + γ 1_{m, s})} .$
- Here ${pen}_{s} (β) = λ \times (1 - \frac{\sum_{m, j} {∣ β_{m, j} ∣}^{0}}{M \times \sum_{j} {∣ {‖ β_{\cdot, j} ‖}_{2} ∣}^{0}})$ , |u|⁰ = 1 if u ≠ 0 and = 0 otherwise, λ ≥ 0 is a data-dependent tuning parameter, ||β_·,_j||₂ is the ℓ₂-norm of β_·,_j, and 1_m_,_s is a d × M matrix with (s,m)th element set to 1.
- Update $β_{m, \hat{s}}^{[k]} = β_{m, \hat{s}}^{[k - 1]} + ν \hat{γ}$ and $f_{m}^{[k]} = f_{m}^{[k - 1]} + ν \hat{γ} X_{m, \hat{s}}$ .
Step 3: Iteration. Repeat Step 2 for K times. K is a large number.
Step 4: Selection of optimal stopping. At iteration k(= 1, …, K), compute $F (k) = \sum_{m} {F_{m} (k) = R_{m} (β_{m}^{[k]}) + pen (β_{m}^{[k]}) + {pen}_{s} (β^{[k]})}$ . Select the optimal number of iterations as k̂ = argmin_1≤_k_≤_K F(k).

Advancing from the existing studies, we propose the pen_s penalty to encourage similarity. In $\frac{\sum_{m, j} {∣ β_{m, j} ∣}^{0}}{M \times \sum_{j} {∣ {‖ β_{\cdot, j} ‖}_{2} ∣}^{0}}$ , the numerator Σ_m_,_j |β_m_,_j|⁰ counts how many covariates are selected across the M datasets, and the denominator Σ_j |||β_·,_j ||₂|⁰ counts how many unique covariates are selected. pen_s is closely related to the Jaccard index of similarity [12] and takes value in $λ \times [0, 1 - \frac{1}{M}]$ . It is minimized if the M datasets identify the same set of covariates, and is maximized if there is no covariate identified in more than one dataset. Thus, it has the capability of promoting similarity. λ determines the degree of regularization. When λ = 0, the proposed method goes back to Algorithm 1. On the other hand, when λ = ∞, the proposed method reinforces that the same set of covariates is selected in all datasets, i.e, the homogeneity structure, which is an extreme case of the heterogeneity structure. We encounter Σ_j |||β_·,_j ||₂|⁰ = 0 in the first step of boosting. To ensure that the boosting is not “trapped”, we take 0/0=1.

With the proposed method, λ needs to be determined data-dependently. In addition, we need to specify a proper pen function for selecting weak learners and a possibly different pen function for stopping. In the sparse boosting literature [9], multiple pen functions have been suggested. However, there is a lack of study showing when one may be better than the others [13]. In this study, we adopt the BIC-based approaches because of their simple forms, broad applicability, and satisfactory numerical performance.

As a specific example, consider a dataset with sample size n and under a linear regression model. With a specific strong learner, denote the residual sum of squares as RSS and degree of freedom as df. We adopt the duo

(selection, stopping) = (BIC + {pen}_{s}, HDBIC with or without {pen}_{s}),

where the BIC criterion is log(RSS) + df × log(n)/n, and the HDBIC criterion is log(RSS) + df × log(n) log(d)/n [14]. Adopting the BIC criterion for selecting weak learners has been motivated by published studies [13]. The HDBIC criterion imposes more penalty than BIC and can generate sparser models. In the literature, there are other BIC-type criteria [15]. We adopt the proposed combination because of its satisfactory performance. It is beyond the scope of this article to comprehensively compare different model complexity criteria.

We have developed an R program implementing the proposed approach. To illustrate the usage, we have also provided demo with sample survival datasets. The code and demo are publicly available at https://github.com/shuanggema/IntSpBoost. The computer time can be potentially reduced by adopting parallel computing.

To further examine the working characteristics of proposed method, we simulate one replicate with three independent datasets. More details on the simulation settings are provided in Table 1 and the next section. We analyze the simulated data using four different methods. The first two do not have pen_s and serve as a reference. More specifically, the first method (Alt.1) is the ordinary boosting and uses R_m(·)’s as the criterion for selection and HDBIC for stopping. The second (Alt.2) is a sparse boosting method and uses BIC for selection and HDBIC for stopping. There are two versions of the proposed method. One uses BIC+pen_s for selection and HDBIC for stopping (New), and the other uses BIC+pen_s for selection and HDBIC+pen_s for stopping (New₊). Table 1 shows that for this specific replicate, the methods New and New₊ yield identical estimates (which is not always true). They outperform Alt.1 and Alt.2 by identifying fewer false positives. Alt.2 outperforms Alt.1 by using BIC in selecting weak learners.

Table 1.

Analysis of one replicate with n_m = 100, d = 100, ρ = 0.2, and half-overlapping marker sets. All nonzero regression coefficients=1. For the (selection, stopping) duo: Alt.1=(L₂, HDBIC), Alt.2=(BIC, HDBIC), New=(BIC₊, HDBIC), and New₊=(BIC₊, HDBIC₊). Estimated coefficient in each cell.

cov	Dataset 1				Dataset 2				Dataset 3
cov	Alt.1	Alt.2	New	New₊	Alt.1	Alt.2	New	New₊	Alt.1	Alt.2	New	New₊
1	0.898	0.992	0.961	0.961	0.667	0.692	0.658	0.658	1.144	1.206	1.192	1.192
2	0.968	1.061	1.031	1.031	1.011	1.082	1.069	1.069	0.681	0.751	0.736	0.736
3	0.798	0.774	0.774	0.774	0.635	0.758	0.717	0.717	0.858	0.938	0.903	0.903
4	0.885	0.984	0.958	0.958
5	0.880	0.937	0.921	0.921
6	0.875	1.028	0.984	0.984
7					0.760	0.883	0.857	0.857
8					1.075	1.072	1.072	1.072
9					1.018	1.145	1.103	1.103
10									0.821	0.937	0.906	0.906
11									0.637	0.691	0.656	0.656
12									0.741	0.852	0.808	0.808
19					−0.053
31									−0.058	−0.024
40	−0.030
58		−0.024
77	−0.050	−0.023
92	−0.022	−0.022

	σ² = 1						σ² = 3

	Complete		Half		None		Complete		Half		None

	TP	FP	TP	FP	TP	FP	TP	FP	TP	FP	TP	FP
nonzero coef ~ unif(0.2,1)
New	15.2 (2.7)	0.1 (0.3)	14.0 (1.8)	1.8 (1.3)	14.8 (1.7)	1.9 (1.7)	7.6 (2.8)	0.5 (0.7)	6.9 (3.3)	0.3 (0.5)	7.6 (2.4)	0.8 (1.0)
New₊	15.7 (3.2)	0.0 (0)	13.9 (2.1)	1.6 (1.4)	14.9 (1.7)	2.0 (1.8)	9.5 (3.2)	0.3 (0.6)	7.6 (2.5)	1.0 (1.4)	7.6 (2.4)	0.8 (1.0)
alg1-SB	13.8 (1.8)	1.8 (1.2)	14.1 (1.9)	2.0 (1.5)	14.1 (2.1)	2.0 (1.7)	7.0 (2.4)	1.0 (1.1)	6.3 (2.5)	0.5 (0.8)	7.2 (2.7)	0.8 (0.9)
sgroup-MCP	15.6 (1.8)	1.1 (1.7)	14.5 (1.9)	1.5 (1.6)	13.4 (2.1)	2.5 (2.6)	11.3 (2.3)	2.5 (2.8)	9.8 (2.5)	2.7 (3.0)	8.0 (2.4)	3.3 (4.2)
group-MCP_T	14.9 (1.8)	0.2 (0.7)	13.9 (1.8)	1.0 (1.8)	13.6 (2.7)	57.5 (94.8)	7.6 (2.1)	0.2 (0.4)	6.9 (2.4)	0.3 (0.7)	7.6 (3.2)	4.8 (8.8)
indiv-SB	13.2 (1.8)	1.6 (0.6)	12.5 (1.9)	1.5 (0.6)	13.3 (1.9)	1.4 (0.6)	6.5 (1.7)	0.3 (0.5)	5.8 (1.6)	0.2 (0.4)	7.0 (1.5)	0.2 (0.4)
indiv-MCP	14.8 (1.7)	18.7 (7.6)	14.5 (1.8)	18.3 (8.2)	14.9 (1.6)	19.2 (8.6)	10.8 (2.3)	19.1 (7.3)	10.4 (2.1)	18.3 (7.0)	10.5 (2.3)	16.8 (6.1)
pool-SB	17.9 (0.6)	0.0 (0)	9.7 (1.4)	3.0 (1.8)	1.1 (0.5)	2.3 (0.9)	15.5 (2.5)	0.1 (0.6)	7.1 (2.5)	1.0 (1.4)	0.8 (0.4)	2.2 (0.4)
pool-MCP	18.0 (0)	20.3 (21.4)	13.2 (1.5)	35.9 (20.8)	6.6 (2.0)	38.7 (21.2)	17.0 (1.8)	27.4 (26.5)	11.0 (2.5)	31.6 (26.6)	3.4 (2.3)	28.5 (25.1)

nonzero coef = 1
New	18.0 (0)	1.4 (1.4)	18.0 (0)	4.8 (2.1)	18.0 (0)	3.8 (1.9)	16.3 (2.0)	0.9 (0.7)	14.9 (3.1)	1.3 (0.9)	16.3 (1.7)	1.1 (1.1)
New₊	18.0 (0)	0.4 (0.8)	18.0 (0)	1.4 (2.0)	18.0 (0)	1.2 (2.1)	17.0 (1.5)	0.5 (0.6)	15.2 (3.0)	1.3 (0.9)	16.3 (1.7)	1.1 (1.1)
alg1-SB	18.0 (0)	4.9 (1.7)	18.0 (0)	5.2 (1.8)	18.0 (0)	4.8 (2.1)	15.9 (2.1)	1.0 (1.1)	14.7 (2.4)	1.2 (1.1)	15.3 (2.5)	1.0 (1.1)
sgroup-MCP	18.0 (0)	0.1 (0.4)	18.0 (0)	0.4 (0.8)	18.0 (0.1)	0.8 (1.2)	17.8 (0.6)	1.5 (1.5)	17.1 (1.1)	2.7 (2.5)	16.5 (1.7)	5.9 (4.6)
group-MCP_T	18.0 (0)	0.0 (0)	18.0 (0.1)	0.2 (0.4)	18.0 (0)	0.0 (0)	17.0 (1.0)	0.0 (0)	14.9 (1.7)	0.2 (0.5)	16.3 (1.7)	11.9 (10.1)
indiv-SB	18.0 (0)	1.0 (0.2)	18.0 (0)	1.1 (0.3)	18.0 (0)	1.1 (0.3)	14.5 (2.0)	1.3 (0.5)	13.0 (1.9)	1.3 (0.5)	15.1 (1.8)	1.3 (0.5)
indiv-MCP	18.0 (0)	11.4 (8.6)	18.0 (0)	11.9 (7.9)	18.0 (0)	11.2 (8.1)	17.4 (0.9)	23.3 (8.0)	17.3 (1.0)	22.9 (7.3)	17.5 (0.7)	22.0 (7.1)
pool-SB	18.0 (0)	0.0 (0)	11.7 (1.3)	5.4 (2.5)	1.0 (0.3)	2.2 (0.5)	18.0 (0)	0.0 (0)	10.3 (0.9)	2.8 (1.9)	0.9 (0.4)	2.1 (0.4)
pool-MCP	18.0 (0)	3.5 (6.9)	15.7 (1.5)	43.7 (22.8)	9.2 (2.5)	51.2 (24.7)	18.0 (0)	11.4 (13.1)	13.9 (1.7)	38.7 (25.1)	6.5 (3.1)	35.1 (22.1)

	(a)		(b)		(c)		(d)

	TP	FP	TP	FP	TP	FP	TP	FP
ρ = 0.2
New	26.5(4.3)	2.0(1.6)	26.1(4.7)	2.9(1.9)	23.4(5.6)	2.3(2.0)	24.8(5.8)	3.0(2.2)
New₊	25.7(7.6)	1.6(1.6)	26.1(4.7)	2.9(2.0)	24.3(4.9)	2.3(2.0)	24.4(6.7)	2.9(2.2)
alg1-SB	21.3(7.1)	2.2(2.3)	22.9(6.3)	1.9(1.8)	23.3(6.7)	1.9(1.6)	24.3(7.2)	2.6(2.4)
indiv-SB	15.3(5.2)	2.9(1.9)	13.8(4.6)	2.5(1.6)	15.1(4.5)	1.8(1.5)	16.1(4.3)	1.9(1.7)

ρ = 0.5
New	35.2(0.9)	4.7(2.3)	34.9(1.2)	6.9(2.5)	34.4(1.7)	6.8(3.1)	35.1(1.1)	8.3(3.3)
New₊	33.8(1.4)	1.6(1.8)	34.5(1.7)	5.2(2.6)	34.4(1.7)	4.8(2.8)	35.1(1.1)	4.3(3.3)
alg1-SB	35.2(0.9)	8.3(2.6)	34.6(1.5)	8.2(3.1)	33.8(1.9)	7.1(3.0)	35.0(1.0)	8.2(3.1)
indiv-SB	35.8(0.7)	3.7(1.6)	34.1(3.9)	3.7(2.2)	33.3(5.1)	2.9(2.0)	35.6(1.3)	4.3(1.7)

ρ = 0.8
New	35.4(0.7)	6.3(2.0)	35.2(0.8)	9.5(2.2)	35.0(0.8)	10.1(2.7)	35.2(1.1)	9.5(2.1)
New₊	35.0(1.0)	4.5(1.8)	35.1(1.1)	4.0(2.6)	35.2(0.9)	4.1(1.9)	35.2(1.1)	4.7(2.3)
alg1-SB	35.3(0.9)	10.4(2.0)	35.0(0.8)	11.2(2.7)	35.1(0.8)	10.3(2.7)	35.2(0.8)	10.6(2.9)
indiv-SB	35.5(0.6)	2.6(1.3)	35.6(0.5)	3.2(1.3)	35.7(0.6)	2.8(1.8)	35.5(0.5)	2.2(1.4)

Unigene	gene	Alt.1	alg1-SB	New	New₊	indiv-SB	pool-SB
Dataset 1
Hs.100090	TSPAN3						0.067
Hs.101382	TNFAIP2						0.028
Hs.10247	ALCAM						0.063
Hs.153752	CDC25B		0.037			0.037
Hs.106778	ATP2C1	0.059	0.059	0.111		0.100
Hs.111126	PTTG1IP	0.041				0.035
Hs.115617	CRHBP	0.097	0.097	0.131		0.097
Hs.124029	INPP5A	0.045				0.072
Hs.1265	BCKDHB	−0.109	−0.109	−0.158	−0.111	−0.109
Hs.151531	PPP3CB	−0.051	−0.096	−0.043		−0.191
Hs.153687	INPP4B	0.042	0.079			0.189

Dataset 2
Hs.100090	TSPAN3						0.067
Hs.101382	TNFAIP2						0.028
Hs.10247	ALCAM						0.063
Hs.101813	SLC9A3R2	0.071	0.071	0.037		0.071
Hs.102456	GEMIN2	0.126	0.126	0.128		0.157
Hs.105806	GNLY	0.028	0.028			0.097
Hs.108332	UBE2D2
Hs.1265	BCKDHB			0.008	0.008
Hs.1311	CD1C	0.163	0.202	0.167		0.202

Dataset 3
Hs.100090	TSPAN3	0.034	0.032	0.031			0.067
Hs.101382	TNFAIP2						0.028
Hs.10247	ALCAM						0.063
Hs.100030	TERF2	0.026	0.044	0.022		0.026
Hs.103081	RPS6KB2	0.026	0.025
Hs.106674	BAP1	−0.117	−0.116	−0.077		−0.086
Hs.108332	UBE2D2	−0.063	−0.063	−0.096	−0.035	−0.063
Hs.110707	DCAF8	−0.026	−0.026	−0.025
Hs.1265	BCKDHB			−0.006	−0.006

	Complete		Half		None

	TP	FP	TP	FP	TP	FP
nonzero coef ~ unif(0.2,1)
ρ = 0.2
New	6.1 (2.3)	3.1 (3.0)	5.3 (2.1)	5.2 (3.4)	4.8 (2.2)	5.2 (2.7)
New₊	6.0 (2.0)	2.5 (2.7)	5.2 (2.2)	4.7 (3.5)	4.8 (2.1)	5.2 (2.8)
alg1-SB	4.9 (1.5)	5.6 (2.9)	5.1 (2.1)	6.0 (2.8)	4.9 (2.3)	5.2 (2.8)
indiv-SB	4.8 (1.4)	4.3 (1.9)	5.3 (2.0)	4.7 (2.1)	5.2 (1.9)	4.7 (2.1)
ρ = 0.5
New	8.4 (2.8)	1.2 (1.7)	7.2 (2.0)	2.8 (2.0)	7.0 (1.8)	4.1 (1.7)
New₊	8.8 (3.5)	0.9 (1.5)	7.1 (1.7)	2.6 (2.2)	7.0 (1.9)	4.0 (1.7)
alg1-SB	7.3 (1.9)	3.4 (2.4)	6.7 (1.8)	4.3 (2.1)	7.0 (1.8)	4.1 (1.6)
indiv-SB	7.2 (1.7)	2.6 (1.4)	7.1 (1.9)	3.6 (1.9)	6.8 (2.0)	2.7 (1.6)
ρ = 0.8
New	12.1 (3.2)	0.9 (1.6)	11.2 (2.3)	1.9 (1.3)	12.9 (1.6)	2.3 (1.9)
New₊	12.4 (3.2)	0.5 (0.8)	10.8 (3.0)	1.5 (1.3)	13.0 (1.8)	2.4 (2.1)
alg1-SB	13.1 (2.0)	1.9 (1.6)	11.3 (2.2)	2.2 (1.3)	12.8 (1.8)	2.2 (1.7)
indiv-SB	12.8 (2.0)	1.4 (1.5)	10.8 (1.9)	1.3 (1.0)	12.8 (1.8)	1.5 (1.4)

nonzero coef = 1
ρ = 0.2
New	7.4 (2.7)	4.4 (3.5)	6.5 (1.7)	5.1 (3.2)	6.1 (2.4)	8.2 (2.7)
New₊	6.9 (2.7)	3.9 (3.4)	6.3 (1.7)	4.9 (3.3)	5.9 (2.4)	8.2 (2.8)
alg1-SB	6.1 (1.8)	8.8 (3.5)	5.8 (1.5)	8.1 (3.1)	6.1 (2.4)	8.3 (2.8)
indiv-SB	6.1 (1.9)	7.5 (3.0)	6.0 (1.6)	7.4 (2.3)	6.1 (2.3)	7.5 (3.2)
ρ = 0.5
New	11.3 (3.0)	1.7 (1.6)	9.0 (2.1)	4.0 (2.1)	8.7 (2.1)	4.7 (1.7)
New₊	11.0 (3.6)	1.3 (1.7)	9.0 (2.4)	4.3 (2.9)	8.7 (2.1)	4.8 (1.8)
alg1-SB	9.2 (2.0)	4.8 (2.2)	8.8 (2.3)	5.3 (2.1)	8.7 (2.1)	4.8 (1.8)
indiv-SB	9.2 (2.1)	4.4 (2.0)	8.3 (2.0)	4.2 (2.1)	8.6 (2.4)	3.9 (1.6)
ρ = 0.8
New	15.4 (2.0)	1.7 (2.3)	14.2 (1.6)	2.7 (1.7)	15.2 (1.4)	3.2 (1.9)
New₊	15.1 (3.0)	0.9 (1.8)	13.9 (2.7)	2.3 (1.9)	15.3 (1.5)	3.2 (1.8)
alg1-SB	15.4 (1.5)	3.2 (2.6)	14.2 (1.5)	3.1 (2.1)	15.3 (1.5)	3.1 (1.8)
indiv-SB	15.0 (1.8)	2.5 (2.4)	13.4 (2.0)	2.0 (1.7)	15.0 (1.6)	2.7 (2.5)

UniGene	Alt.1	alg1-SB	New	New₊	indiv-SB	pool-SB
Dataset 1
Hs.646						−0.187
Hs.19413		0.227	0.292
Hs.19699						−0.135
Hs.25351	−0.179	−0.287	−0.393	−0.343	−0.287
Hs.75149	−0.049	−0.117			−0.117
Hs.78881	−0.204	−0.342	−0.367	−0.257	−0.271
Hs.82548	0.073	0.305	0.392	0.308	0.305
Hs.95821						0.211
Hs.154443						0.057
Hs.154797						−0.093
Hs.274382	−0.061	−0.030			−0.030
Hs.288319	0.025
Hs.431584	−0.160	−0.518	−0.560	−0.419	−0.428

Dataset 2
Hs.646	−0.034					−0.187
Hs.2421	−0.032
Hs.15303	−0.109	−0.536	−0.544	−0.489	−0.296
Hs.19699						−0.135
Hs.25351			0.011	0.011
Hs.82548	−0.032		−0.015	−0.015
Hs.89506	0.032
Hs.95821	0.053	0.255	0.278			0.211
Hs.154443	0.088					0.057
Hs.154797			−0.013	−0.013		−0.093
Hs.407372	0.087

Dataset 3
Hs.646						−0.187
Hs.1578	0.047	0.070			0.070
Hs.14541	0.067
Hs.19699						−0.135
Hs.75617	0.022
Hs.89399	−0.049
Hs.95821						0.211
Hs.154443						0.057
Hs.154797	−0.185	−0.556	−0.635	−0.448	−0.565	−0.093
Hs.177584	0.163	0.421	0.466	0.315	0.442
Hs.206770	0.085	0.509	0.598	0.401	0.528
Hs.301094	−0.021

	σ² = 1						σ² = 3

	Complete		Half		None		Complete		Half		None

	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE
nozero coef ~ unif(0.2,1)
New	1.12 (1.03)	0.92 (0.79)	0.95 (0.24)	0.82 (0.21)	0.97 (0.31)	0.80 (0.25)	1.21 (0.34)	2.84 (0.78)	1.24 (0.27)	3.00 (0.64)	1.18 (0.37)	2.79 (0.81)
New₊	1.08 (1.04)	0.90 (0.80)	0.96 (0.26)	0.83 (0.22)	0.97 (0.31)	0.80 (0.25)	1.14 (0.37)	2.69 (0.86)	1.30 (0.31)	3.14 (0.76)	1.17 (0.37)	2.77 (0.81)
alg1-SB	1.07 (0.45)	0.87 (0.33)	1.18 (0.33)	0.99 (0.28)	0.92 (0.33)	0.75 (0.25)	1.18 (0.29)	2.79 (0.66)	1.23 (0.36)	2.96 (0.85)	1.21 (0.33)	2.86 (0.74)
sgroup-MCP	0.44 (0.19)	0.41 (0.18)	0.59 (0.28)	0.56 (0.27)	0.74 (0.31)	0.69 (0.30)	0.65 (0.27)	1.85 (0.81)	0.77 (0.30)	2.29 (0.96)	0.89 (0.31)	2.66 (1.01)
group-MCP_T	0.31 (0.15)	0.33 (0.16)	0.73 (0.31)	0.77 (0.31)	0.92 (0.34)	0.97 (0.36)	1.76 (0.77)	1.80 (0.76)	2.43 (1.00)	2.45 (0.94)	4.03 (1.07)	4.12 (1.04)
indiv-SB	1.13 (0.28)	1.00 (0.24)	1.30 (0.45)	1.15 (0.40)	1.22 (0.43)	1.05 (0.36)	1.14 (0.28)	2.76 (0.65)	1.24 (0.40)	3.05 (0.95)	1.17 (0.34)	2.81 (0.76)
indiv-MCP	0.62 (0.24)	0.66 (0.25)	0.66 (0.24)	0.69 (0.26)	0.65 (0.23)	0.69 (0.25)	0.78 (0.30)	2.37 (0.95)	0.84 (0.31)	2.50 (0.92)	0.81 (0.27)	2.42 (0.82)
pool-SB	1.03 (0.28)	1.01 (0.29)	4.86 (1.12)	4.00 (0.88)	9.22 (1.64)	6.89 (1.19)	0.63 (0.15)	1.75 (0.45)	1.88 (0.38)	4.58 (0.90)	3.17 (0.56)	7.09 (1.20)
pool-MCP	0.76 (0.23)	0.78 (0.23)	4.02 (0.91)	3.53 (0.79)	7.87 (1.51)	6.24 (1.20)	0.42 (0.11)	1.32 (0.38)	1.52 (0.37)	4.06 (0.99)	2.82 (0.53)	6.68 (1.31)

nonzero coef = 1
New	0.49 (0.12)	0.45 (0.12)	0.63 (0.19)	0.55 (0.16)	0.62 (0.18)	0.52 (0.15)	1.00 (0.30)	2.48 (0.75)	1.24 (0.37)	3.15 (0.92)	1.10 (0.38)	2.74 (0.93)
New₊	0.48 (0.11)	0.45 (0.11)	0.66 (0.20)	0.57 (0.17)	0.63 (0.18)	0.52 (0.15)	1.10 (0.48)	2.68 (1.11)	1.24 (0.37)	3.15 (0.92)	1.10 (0.38)	2.74 (0.93)
alg1-SB	0.70 (0.22)	0.60 (0.19)	0.64 (0.22)	0.56 (0.18)	0.68 (0.20)	0.57 (0.14)	1.03 (0.31)	2.55 (0.76)	1.18 (0.46)	3.01 (1.14)	1.11 (0.32)	2.70 (0.79)
sgroup-MCP	0.25 (0.09)	0.24 (0.09)	0.25 (0.10)	0.24 (0.10)	0.26 (0.12)	0.25 (0.12)	0.31 (0.18)	0.88 (0.49)	0.44 (0.25)	1.24 (0.75)	0.56 (0.30)	1.57 (0.84)
group-MCP_T	0.23 (0.08)	0.24 (0.08)	0.25 (0.16)	0.27 (0.16)	0.24 (0.08)	0.25 (0.08)	0.68 (0.33)	0.73 (0.35)	1.62 (0.99)	1.71 (1.01)	2.07 (1.13)	2.16 (1.17)
indiv-SB	0.83 (0.22)	0.76 (0.21)	0.94 (0.38)	0.86 (0.33)	0.86 (0.36)	0.78 (0.32)	1.38 (0.36)	3.73 (1.02)	1.63 (0.68)	4.34 (1.64)	1.40 (0.48)	3.72 (1.20)
indiv-MCP	0.28 (0.12)	0.30 (0.12)	0.29 (0.13)	0.31 (0.13)	0.28 (0.13)	0.30 (0.13)	0.60 (0.26)	1.93 (0.83)	0.64 (0.33)	2.01 (1.03)	0.64 (0.29)	2.07 (0.92)
pool-SB	0.22 (0.08)	0.19 (0.08)	11.10 (0.70)	8.82 (0.50)	23.80 (0.65)	17.23 (0.37)	0.30 (0.11)	0.79 (0.30)	3.96 (0.28)	9.39 (0.66)	8.09 (0.20)	17.50 (0.33)
pool-MCP	0.08 (0.06)	0.09 (0.06)	8.79 (0.47)	7.49 (0.44)	20.36 (0.76)	15.81 (0.63)	0.09 (0.06)	0.29 (0.18)	3.13 (0.22)	8.04 (0.57)	7.01 (0.30)	16.33 (1.00)

	σ² = 1						σ² = 3

	Complete		Half		None		Complete		Half		None

	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE
nonzero coef ~ unif(0.2,1)
New	0.57 (0.20)	0.69 (0.35)	0.78 (0.24)	0.71 (0.19)	0.59 (0.19)	0.58 (0.20)	0.63 (0.19)	1.51 (0.50)	0.76 (0.23)	1.80 (0.59)	0.68 (0.21)	1.68 (0.49)
New₊	0.57 (0.19)	0.68 (0.34)	0.81 (0.26)	0.73 (0.20)	0.58 (0.19)	0.57 (0.20)	0.64 (0.24)	1.52 (0.53)	0.75 (0.23)	1.79 (0.60)	0.67 (0.21)	1.68 (0.48)
alg1-SB	0.63 (0.18)	0.56 (0.16)	0.71 (0.26)	0.66 (0.23)	0.62 (0.15)	0.60 (0.19)	0.72 (0.17)	1.71 (0.49)	0.76 (0.20)	1.92 (0.53)	0.70 (0.18)	1.70 (0.42)
sgroup-MCP	0.80 (0.42)	0.48 (0.22)	0.94 (0.41)	0.58 (0.23)	1.38 (0.58)	0.79 (0.31)	1.08 (0.42)	1.92 (0.72)	1.21 (0.46)	2.29 (0.85)	1.47 (0.47)	2.66 (0.78)
group-MCP_T	0.27 (0.18)	0.39 (0.22)	0.69 (0.32)	1.04 (0.43)	1.04 (0.35)	1.60 (0.46)	1.06 (0.55)	1.60 (0.90)	2.20 (0.81)	3.26 (1.14)	3.18 (0.87)	4.69 (1.19)
indiv-SB	0.66 (0.17)	0.75 (0.23)	0.77 (0.21)	0.90 (0.27)	0.64 (0.18)	0.76 (0.23)	0.70 (0.22)	1.96 (0.66)	0.82 (0.22)	2.31 (0.67)	0.69 (0.19)	2.00 (0.69)
indiv-MCP	0.67 (0.23)	1.09 (0.41)	0.65 (0.25)	1.06 (0.42)	0.68 (0.29)	1.18 (0.50)	0.80 (0.23)	3.84 (1.40)	0.76 (0.22)	3.48 (1.24)	0.81 (0.26)	4.08 (1.49)
pool-SB	0.80 (0.27)	0.86 (0.23)	5.05 (0.99)	3.88 (0.85)	12.92 (2.54)	6.59 (1.25)	0.44 (0.11)	1.43 (0.44)	1.94 (0.34)	4.30 (0.95)	4.58 (0.85)	6.84 (1.28)
pool-MCP	0.69 (0.27)	0.78 (0.23)	4.66 (1.07)	4.21 (1.04)	11.92 (2.53)	7.52 (1.84)	0.39 (0.14)	1.67 (0.77)	1.79 (0.45)	5.26 (1.78)	4.12 (0.84)	7.88 (2.00)

nonzero coef = 1
New	0.46 (0.20)	0.62 (0.34)	0.54 (0.13)	0.49 (0.13)	0.49 (0.15)	0.50 (0.17)	0.58 (0.21)	1.65 (0.59)	0.77 (0.24)	2.08 (0.70)	0.64 (0.14)	1.88 (0.46)
New₊	0.46 (0.20)	0.62 (0.34)	0.56 (0.14)	0.50 (0.13)	0.49 (0.15)	0.50 (0.17)	0.61 (0.22)	1.69 (0.58)	0.78 (0.24)	2.10 (0.71)	0.64 (0.14)	1.88 (0.46)
alg1-SB	0.50 (0.16)	0.49 (0.11)	0.59 (0.17)	0.57 (0.17)	0.53 (0.21)	0.51 (0.21)	0.66 (0.27)	1.91 (0.76)	0.71 (0.21)	1.99 (0.73)	0.66 (0.19)	1.80 (0.61)
sgroup-MCP	0.35 (0.17)	0.23 (0.12)	0.36 (0.21)	0.24 (0.12)	0.49 (0.43)	0.31 (0.23)	0.58 (0.39)	1.09 (0.69)	1.03 (0.74)	1.90 (1.22)	1.53 (0.86)	2.62 (1.21)
group-MCP_T	0.23 (0.11)	0.35 (0.16)	0.25 (0.10)	0.36 (0.14)	0.40 (0.23)	0.56 (0.30)	0.78 (0.55)	1.11 (0.64)	2.03 (0.93)	2.98 (1.38)	3.29 (1.19)	4.93 (2.06)
indiv-SB	0.52 (0.14)	0.62 (0.21)	0.60 (0.16)	0.75 (0.25)	0.50 (0.16)	0.62 (0.27)	0.67 (0.19)	2.35 (0.85)	0.85 (0.25)	3.15 (1.02)	0.67 (0.18)	2.45 (0.89)
indiv-MCP	0.27 (0.11)	0.40 (0.17)	0.26 (0.09)	0.39 (0.15)	0.26 (0.11)	0.40 (0.19)	0.77 (0.34)	3.81 (2.03)	0.72 (0.32)	3.63 (1.95)	0.73 (0.32)	3.87 (2.06)
pool-SB	0.15 (0.06)	0.17 (0.09)	11.51 (0.76)	8.38 (0.61)	33.66 (1.72)	16.48 (0.62)	0.18 (0.07)	0.58 (0.34)	4.13 (0.28)	9.00 (0.79)	11.52 (0.60)	16.69 (0.62)
pool-MCP	0.07 (0.05)	0.10 (0.08)	10.59 (0.53)	8.99 (0.67)	31.19 (1.24)	19.40 (2.18)	0.07 (0.05)	0.32 (0.24)	3.75 (0.27)	9.88 (1.05)	10.62 (0.58)	20.17 (2.87)

	σ² = 1						σ² = 3

	Complete		Half		None		Complete		Half		None

	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE
nonzero coef ~ unif(0.2,1)
New	0.36 (0.10)	1.00 (0.39)	0.41 (0.12)	1.04 (0.35)	0.38 (0.12)	1.02 (0.38)	0.31 (0.09)	2.23 (0.85)	0.48 (0.14)	2.27 (0.67)	0.35 (0.12)	2.29 (0.69)
New₊	0.37 (0.11)	1.00 (0.39)	0.43 (0.14)	1.04 (0.35)	0.38 (0.12)	1.02 (0.38)	0.31 (0.09)	2.23 (0.85)	0.48 (0.13)	2.27 (0.67)	0.35 (0.12)	2.29 (0.69)
alg1-SB	0.39 (0.11)	0.97 (0.41)	0.44 (0.12)	1.05 (0.40)	0.40 (0.16)	1.03 (0.42)	0.39 (0.10)	2.41 (0.57)	0.45 (0.13)	2.36 (0.76)	0.34 (0.10)	2.31 (0.70)
sgroup-MCP	2.44 (1.55)	0.54 (0.27)	3.03 (1.39)	0.72 (0.31)	3.96 (1.49)	0.85 (0.27)	2.36 (0.77)	1.64 (0.56)	2.68 (0.75)	2.07 (0.58)	3.07 (0.91)	2.22 (0.68)
group-MCP_T	0.31 (0.19)	0.98 (0.42)	0.62 (0.26)	2.20 (0.71)	0.94 (0.34)	3.42 (0.90)	1.05 (0.58)	3.53 (1.66)	1.79 (0.64)	5.77 (1.56)	2.44 (0.69)	8.33 (2.17)
indiv-SB	0.38 (0.11)	1.05 (0.42)	0.42 (0.13)	1.17 (0.53)	0.41 (0.14)	1.22 (0.50)	0.37 (0.11)	2.40 (0.74)	0.43 (0.11)	2.73 (0.91)	0.37 (0.10)	2.61 (0.88)
indiv-MCP	0.86 (0.22)	4.51 (1.52)	0.86 (0.26)	4.14 (1.73)	0.93 (0.29)	4.83 (1.61)	0.71 (0.17)	8.81 (2.72)	0.76 (0.19) 8.65	(2.16)	0.76 (0.25)	9.16 (2.64)
pool-SB	0.68 (0.40)	0.99 (0.34)	3.58 (0.87)	3.88 (1.01)	13.82 (2.73)	7.12 (1.44)	0.33 (0.16)	1.70 (0.70)	1.34 (0.33)	4.40 (1.21)	4.77 (0.92)	7.38 (1.60)
pool-MCP	0.83 (0.47)	2.02 (1.28)	3.79 (0.87)	6.08 (1.78)	14.07 (2.57)	10.85 (2.09)	0.47 (0.15)	5.13 (2.06)	1.44 (0.32)	7.51 (1.99)	4.87 (0.89)	11.84 (2.48)

nonzero coef = 1
New	0.37 (0.11)	1.30 (0.49)	0.47 (0.12)	1.30 (0.59)	0.46 (0.13)	1.30 (0.48)	0.38 (0.14)	3.47 (1.46)	0.47 (0.16)	3.52 (1.34)	0.37 (0.14)	3.08 (1.14)
New₊	0.39 (0.12)	1.31 (0.49)	0.47 (0.12)	1.30 (0.59)	0.46 (0.13)	1.30 (0.48)	0.42 (0.19)	3.48 (1.49)	0.48 (0.17)	3.53 (1.34)	0.37 (0.14)	3.07 (1.15)
alg1-SB	0.44 (0.18)	1.45 (0.68)	0.49 (0.15)	1.37 (0.51)	0.41 (0.11)	1.34 (0.48)	0.38 (0.12)	2.91 (1.04)	0.45 (0.12)	3.29 (1.30)	0.43 (0.10)	3.26 (1.21)
sgroup-MCP	1.60 (1.43)	0.40 (0.33)	2.60 (1.81)	0.63 (0.39)	4.13 (2.89)	0.88 (0.51)	3.06 (1.62)	1.98 (0.83)	3.88 (1.36)	2.50 (0.75)	4.59 (1.39)	2.83 (0.68)
group-MCP_T	0.26 (0.14)	0.96 (0.48)	0.68 (0.30)	2.40 (0.98)	1.12 (0.44)	3.79 (1.31)	0.95 (0.57)	2.96 (1.57)	1.99 (0.90)	7.19 (2.14)	3.01 (0.90)	11.34 (2.57)
indiv-SB	0.37 (0.12)	1.23 (0.55)	0.42 (0.16)	1.44 (0.76)	0.41 (0.13)	1.45 (0.65)	0.39 (0.11)	3.29 (1.16)	0.45 (0.14)	4.05 (1.74)	0.42 (0.13)	4.00 (1.46)
indiv-MCP	1.41 (0.33)	9.09 (2.51)	1.08 (0.33)	6.30 (2.09)	1.51 (0.36)	9.71 (2.58)	0.97 (0.24)	15.33 (3.45)	1.01 (0.23)	14.72 (3.45)	1.03 (0.25)	16.31 (3.39)
pool-SB	0.10 (0.06)	0.29 (0.19)	7.84 (0.35)	8.09 (0.55)	36.74 (1.39)	18.13 (1.75)	0.11 (0.06)	0.90 (0.60)	2.76 (0.19)	8.85 (0.98)	12.43 (0.57)	18.25 (2.01)
pool-MCP	0.08 (0.09)	0.27 (0.25)	8.38 (0.51)	12.69 (2.94)	37.40 (1.54)	27.67 (3.06)	0.21 (0.16)	2.57 (2.61)	3.00 (0.21)	14.46 (3.00)	12.64 (0.62)	28.59 (3.44)

	σ² = 1						σ² = 3

	Complete		Half		None		Complete		Half		None

	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE	PMSE	EMSE
nonzero coef ~ unif(0.2,1)
New	1.69 (1.12)	1.37 (0.86)	1.81 (0.53)	1.48 (0.41)	1.81 (0.59)	1.41 (0.44)	1.78 (0.42)	4.07 (0.96)	1.79 (0.41)	4.25 (0.92)	1.71 (0.39)	3.91 (0.90)
New₊	1.58 (1.14)	1.28 (0.88)	1.85 (0.56)	1.51 (0.44)	1.80 (0.59)	1.41 (0.44)	1.68 (0.47)	3.84 (1.07)	1.82 (0.53)	4.32 (1.20)	1.71 (0.39)	3.91 (0.90)
alg1-SB	1.73 (0.48)	1.34 (0.36)	1.99 (0.65)	1.61 (0.51)	1.69 (0.40)	1.33 (0.30)	1.87 (0.45)	4.26 (1.01)	1.77 (0.36)	4.21 (0.84)	1.73 (0.44)	3.97 (1.00)
sgroup-MCP	0.49 (0.22)	0.46 (0.22)	0.69 (0.27)	0.65 (0.26)	0.86 (0.33)	0.80 (0.31)	0.79 (0.31)	2.29 (0.97)	1.01 (0.35)	3.03 (1.08)	1.11 (0.31)	3.38 (0.96)
group-MCP_T	0.37 (0.25)	0.39 (0.26)	0.74 (0.32)	0.78 (0.33)	1.53 (0.74)	1.60 (0.69)	2.62 (0.90)	2.53 (0.79)	3.48 (1.38)	3.42 (1.17)	4.42 (1.37)	4.18 (1.53)
indiv-SB	1.53 (0.60)	1.23 (0.44)	1.61 (0.45)	1.35 (0.34)	1.56 (0.49)	1.26 (0.37)	1.80 (0.49)	4.12 (1.10)	1.77 (0.51)	4.22 (1.19)	1.81 (0.46)	4.13 (1.04)
indiv-MCP	0.88 (0.30)	0.91 (0.31)	0.90 (0.24)	0.94 (0.27)	0.88 (0.24)	0.92 (0.27)	1.04 (0.30)	2.89 (0.82)	1.03 (0.32)	2.93 (0.99)	1.00 (0.28)	2.83 (0.90)
pool-SB	1.03 (0.30)	0.99 (0.27)	5.07 (1.06)	4.13 (0.82)	9.57 (1.77)	7.13 (1.27)	0.76 (0.21)	1.98 (0.49)	1.98 (0.40)	4.78 (0.92)	3.23 (0.60)	7.21 (1.29)
pool-MCP	0.75 (0.23)	0.77 (0.22)	4.21 (0.91)	3.64 (0.78)	8.33 (1.48)	6.50 (1.15)	0.45 (0.10)	1.40 (0.31)	1.63 (0.37)	4.25 (0.98)	3.02 (0.54)	7.02 (1.23)

nonzero coef = 1
New	0.95 (0.31)	0.77 (0.25)	1.33 (0.31)	1.10 (0.25)	1.19 (0.28)	0.98 (0.24)	1.91 (0.85)	4.53 (1.87)	2.28 (0.81)	5.60 (1.92)	1.99 (0.58)	4.75 (1.36)
New₊	1.18 (0.43)	0.94 (0.34)	1.33 (0.31)	1.10 (0.25)	1.19 (0.28)	0.97 (0.24)	1.87 (0.68)	4.45 (1.53)	2.30 (0.75)	5.65 (1.79)	1.99 (0.58)	4.75 (1.36)
alg1-SB	1.38 (0.34)	1.11 (0.27)	1.16 (0.24)	0.96 (0.19)	1.27 (0.35)	1.04 (0.27)	2.08 (1.01)	4.92 (2.27)	2.65 (1.13)	6.42 (2.66)	2.08 (0.79)	4.90 (1.77)
sgroup-MCP	0.23 (0.11)	0.22 (0.11)	0.25 (0.11)	0.23 (0.11)	0.25 (0.10)	0.23 (0.09)	0.34 (0.21)	0.94 (0.58)	0.62 (0.43)	1.72 (1.15)	0.79 (0.44)	2.16 (1.18)
group-MCP_T	0.20 (0.07)	0.22 (0.08)	0.24 (0.14)	0.26 (0.14)	0.28 (0.17)	0.29 (0.17)	0.89 (0.67)	0.94 (0.69)	2.11 (1.27)	2.25 (1.37)	7.66 (4.73)	7.49 (4.23)
indiv-SB	0.90 (0.66)	0.78 (0.57)	0.91 (0.55)	0.80 (0.45)	0.90 (0.53)	0.78 (0.44)	2.14 (0.76)	5.13 (1.64)	2.36 (0.74)	5.85 (1.66)	2.15 (0.76)	5.14 (1.65)
indiv-MCP	0.29 (0.12)	0.30 (0.14)	0.29 (0.10)	0.31 (0.10)	0.33 (0.18)	0.35 (0.19)	0.87 (0.31)	2.70 (0.92)	0.83 (0.28)	2.59 (0.85)	0.87 (0.31)	2.75 (1.03)
pool-SB	0.20 (0.10)	0.17 (0.08)	11.28 (0.61)	8.89 (0.42)	24.50 (0.45)	17.64 (0.25)	0.29 (0.14)	0.73 (0.36)	4.09 (0.26)	9.57 (0.56)	8.24 (0.10)	17.76 (0.14)
pool-MCP	0.11 (0.10)	0.11 (0.10)	9.38 (0.61)	7.90 (0.62)	21.32 (1.14)	16.35 (0.93)	0.11 (0.10)	0.35 (0.31)	3.40 (0.29)	8.60 (0.87)	7.39 (0.43)	16.90 (1.11)

PERMALINK

Promoting similarity of model sparsity structures in integrative analysis of cancer genetic data

Yuan Huang

Jin Liu

Huangdi Yi

Ben-Chang Shia

Shuangge Ma

Abstract

1. Introduction

2. Integrative Analysis and Marker Selection using Sparse Boosting

2.1. Sparse boosting a single dataset

Algorithm 0: Sparse boosting a single dataset

2.2. Integrative sparse boosting multiple datasets

Algorithm 1: Sparse boosting for integrative analysis

2.3. Promoting the similarity of model sparsity structures

Algorithm 2: Sparse boosting for integrative analysis that promotes the similarity

Table 1.

2.4. Potential extensions

3. Numerical Study

3.1. Simulation

Figure 1.

Table 2.

Table 3.

3.2. Analysis of breast cancer prognosis studies

Table 4.

4. Discussion

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

Table 13.

Table 14.

Table 15.

Table 16.

Table 17.

Table 18.

Table 19.

Table 20.

Table 21.

Acknowledgments

Appendix

A. Estimation under the AFT model

B. Boosting with tree-based weak learners

Algorithm 2 (tree): Sparse boosting for integrative analysis with tree-based weak learners

C. Boosting under the Cox model

Algorithm 2 (Cox): Sparse boosting for integrative analysis under the Cox model

D. Additional tables and figures

Figure 2.

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

Figure 8.

Figure 9.

Figure 10.

Figure 11.

Figure 12.

Figure 13.

Figure 14.

Figure 15.

Figure 16.

Figure 17.

Figure 18.

Figure 19.

Table 22.

Table 23.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases