Analyzing Large Datasets with Bootstrap Penalization

Kuangnan Fang; Shuangge Ma

doi:10.1002/bimj.201600052

. Author manuscript; available in PMC: 2018 Mar 1.

Published in final edited form as: Biom J. 2016 Nov 21;59(2):358–376. doi: 10.1002/bimj.201600052

Analyzing Large Datasets with Bootstrap Penalization

Kuangnan Fang ¹, Shuangge Ma ^1,²

PMCID: PMC5577005 NIHMSID: NIHMS899731 PMID: 27870109

Abstract

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered. For many problems, regularization especially penalization is adopted for estimation and variable selection. The straightforward application of penalization to large datasets demands a “big computer” with high computational power. To improve computational feasibility, we develop bootstrap penalization, which dissects a big penalized estimation into a set of small ones, which can be executed in a highly parallel manner and each only demands a “small computer”. The proposed approach takes different strategies for data with different characteristics. For data with a large p but a small to moderate n, covariates are first clustered into relatively homogeneous blocks. The proposed approach consists of two sequential steps. In each step and for each bootstrap sample, we select blocks of covariates and run penalization. The results from multiple bootstrap samples are pooled to generate the final estimate. For data with a large n but a small to moderate p, we bootstrap a small number of subjects, apply penalized estimation, and then conduct a weighted average over multiple bootstrap samples. For data with a large p and a large n, the natural marriage of the previous two methods is applied. Numerical studies, including simulations and data analysis, show that the proposed approach has computational and numerical advantages over the straightforward application of penalization. An R package has been developed to implement the proposed methods.

Keywords: large datasets, computational feasibility, penalization, bootstrap

1 Introduction

Data with a large p (number of covariates) and/or a large n (sample size) are now commonly encountered in many fields. See Jacobs (2009), Schadt et al. (2010), Fan et al. (2014) and many others in the literature. In the analysis of large datasets, regularization techniques, especially penalization, have been commonly applied for estimation and variable selection. With large datasets, the straightforward application of penalization (or another regularization technique) demands a “big computer” with a high input/output capacity and superior computational power (Richtarik and Takac, 2012; Cevher et al., 2014). In practice, researchers may not have such a big computer but multiple “small computers” with a moderate computational capacity. In this study, the goal is to develop a novel approach which can improve the computational feasibility of penalization for large datasets. With data getting bigger and bigger, the studied problem is important and timely.

First consider data with a large p but a small to moderate n. When the dimensionality is not too high, penalization has been directly applied (Tibshirani, 1996; Zou and Hastie, 2005; Zou, 2006; Zhang, 2010; Fan and Li, 2001). With ultrahigh-dimensional covariates, some studies have applied marginal screening to improve computational feasibility (Li et al., 2005; Fan and Song, 2010). The marginal screening approach develops the innovative strategy of dissecting one big joint analysis problem into a large set of small ones. However, to achieve consistency, it makes the assumption of weak correlations among covariates, which is often violated in practice. A study more relevant to the present one is Wang et al. (2010), which develops the random Lasso approach. This approach has been partly motivated by the random forest (Breiman 2001). To improve the computational feasibility of Lasso penalization, it dissembles a big penalization problem into a set of small ones. More specifically, it draws bootstrap samples from the original data and randomly selects candidate variables, analyzes such data in a highly parallel manner, and then assembles the estimates via averaging. It also has the advantage that the number of selected important variables is not limited by sample size. In addition, it is also observed to have competitive prediction performance. However, as we have found in numerical studies (see details below), when there are moderate to high correlations among covariates, the random Lasso estimates can be biased, and this bias leads to inferior selection.

Second, consider data with a large n but a small to moderate p. In terms of statistical theory, this type of data is the simplest. In computation, intuitively, we can dissect a big dataset into pieces. As long as the pieces are not too small, satisfactory results may be expected from each piece, and the final estimates can be obtained via averaging. This dissecting step shares certain similarity with the “m out of n” bootstrap (Bickel et al., 1997; Politis et al., 1999). For some statistics (e.g., sample mean), this approach leads to exactly the same results as analyzing the whole dataset directly. However, as to be discussed in the next section, not all pieces are equal: some pieces may be “unlucky”, have unsatisfactory estimates, and should be down-weighed. More importantly, simple averaging is insufficient when regularization is applied for variable selection.

In the literature, research on data with both a large p and a large n is relatively limited. In principle, the methods developed under the previous two scenarios can be “combined”. However, this has not been carefully examined. In addition, as the existing methods for analyzing the previous two types of data have limitations, naturally, there is a need for more effective methods for data with both a large p and a large n.

Consider penalized analysis of large datasets. The goal is not to develop a new penalization method but rather to improve computational feasibility with the assistance of bootstrapping samples and variables. The proposed approach is related to some existing ones but advances from them in multiple ways. It is related to the bootstrap technique in randomly selecting covariates/samples but uses this as a way of generating smaller and computationally more feasible datasets, not for inference. It is also related to the random Lasso and other sample averaging methods but differs from them in bootstrapping blocks of covariates, adopting weights for samples, and introducing stability to improve estimation/selection. Numerical studies demonstrate its superiority over the existing alternatives. This study is also more comprehensive by considering three types of large data. It is warranted given the growing popularity of large data and forever limitation of computing power.

2 Data and model settings

Assume n iid subjects {(X₁, y₁), ⋯, (X_n, y_n)}. For subject i, y_i is the response variable, and X_i = (x_i1, ⋯, x_ip)^T is the p-dimensional vector of covariates. Consider the linear model

y_{i} = α + β_{1} x_{i 1} + \dots + β_{p} x_{ip} + ε_{i},

(1)

where α is the intercept, β = (β₁, ⋯, β_p)^T ∈ ℝ^p is the parameter vector to be estimated, and ε_i is the error term with mean 0 and variance σ². For comparability, assume that the covariates have been standardized. When β is sparse or can be approximated by a sparse vector, the Lasso penalization has been extensively applied for regularized estimation and variable selection. The Lasso estimate is defined as

\hat{β} = {argmin}_{α, β} {\sum_{i = 1}^{n} {(y_{i} - α - \sum_{j = 1}^{p} β_{j} x_{ij})}^{2} + λ \sum_{j = 1}^{p} | β_{j} |},

(2)

where λ > 0 is a data-dependent tuning parameter. In the straightforward application of Lasso penalization, covariates corresponding to the nonzero components of β̂ are identified as associated with the response.

It is noted that the linear regression model and Lasso penalization have been adopted for their popularity and simplicity of description. With only minor modifications, the proposed approach is also applicable to other data and model settings where the negative log-likelihood functions can take place of the sum of residuals. The Lasso penalty can also be replaced by the SCAD, MCP, bridge, and other penalties. The proposed strategy can also potentially be coupled with regularization techniques other than penalization.

3 Methods

With large datasets, our strategy is to bootstrap smaller subsets, analyze in a parallel manner, and then assemble the results. As partly shown in Figure 1 (where the shaded areas represent data analyzed in one bootstrap run), the approach varies for different data scenarios. In what follows, we first consider data with a large p but a small to moderate n in Section 3.1 and then data with a large n but a small to moderate p in Section 3.2. The developments in these two subsections are naturally “combined” to analyze data with both a large p and a large n in Section 3.3.

Analysis scheme for data with (a) a large p and a small to moderate n (left), (b) a large n and a small to moderate p (middle), and (c) a large p and a large n (right). The shaded areas represent data analyzed in one bootstrap run.

3.1 Bootstrap penalization for data with a large p

With such data, the key is to reduce p to a more manageable level. The proposed method is realized in three steps.

Step 1

Cluster the p covariates into K non-overlapping blocks.

With a specific clustering approach and number of blocks k, denote (C₁, …, C_k) as the index sets of resulted blocks. C_i ∩ C_j = ∅ for i ≠ j, and Σ_j |C_j| = p. K, the optimal number of blocks, is chosen by minimizing the Dunn Index (Dunn, 1974; Handl et al., 2005) defined as $\frac{{min}_{j \neq l} ({min}_{u \in C_{j}, υ \in C_{l}} dist (u, υ))}{{max}_{j} diam (C_{j})}$ , where dist(u, υ) is the distance between covariates u and υ, and diam(C_j) is the maximum distance between any two covariates in C_j. With the Dunn Index, we generate non-overlapping blocks with the highest degree of compactness and separation.

As to be discussed in more details, the goal of clustering is to generate covariate blocks with weak correlations. When covariates are continuously distributed and properly normalized, we take the K-means clustering as the default. Note that the proposed approach does not assume the existence of specific clustering structures. It is also not the intent to recover the “true” clustering structure. Rather, the goal is to “break correlations”, which can be easily checked in data analysis. We acknowledge that there are a large number of clustering techniques. The K-means is used in our numerical study for its computational simplicity. In principle, other clustering techniques can also be used, as long as the resulting clusters have weak correlations (in our numerical study, we find that this is satisfied).

Step 2

Generate an importance measure for each block.

Draw B₂ bootstrap samples – each with sample size n – by sampling with replacement from the original data.
For b₂ = 1 to B₂:
1. Select k₁ candidate blocks at random from the K blocks. Denote $E_{1}^{b_{2}} \subseteq {1, 2, \dots, K}$ as the index set of selected blocks of the b₂th bootstrap sample and $H_{1}^{b_{2}}$ as the corresponding index set of selected covariates.
2. Apply Lasso to the bootstrap sample and generate the estimate by minimizing
  $\sum_{i = 1}^{n} {(y_{i}^{b_{2}} - α - \sum_{j \in H_{1}^{b_{2}}} β_{j} x_{ij}^{b_{2}})}^{2} + λ \sum_{j \in H_{1}^{b_{2}}} | β_{j} | .$
  Here we use the superscript “b₂” to denote the b₂th bootstrap sample. Note that only coefficients for covariates in $H_{1}^{b_{2}}$ are estimated. Set the estimate for β_j = 0 for $j \in {1, \dots, p} \ H_{1}^{b_{2}}$ . Denote the resulted estimate as ${\hat{β}}_{j}^{(b_{2})}$ .
Compute the importance measure of block k(= 1, …, K) as
$I_{k} = \sqrt{\sum_{j \in C_{k}} {(B_{2}^{- 1} \sum_{b_{2} = 1}^{B_{2}} {\hat{β}}_{j}^{(b_{2})})}^{2}} .$

This step generates the importance measures for the K blocks. Each bootstrap dataset contains the same number of subjects as the original dataset but a smaller number of covariates when k₁ < K. For a single bootstrap dataset, the computational burden can be considerably lower than that of the original dataset if $| H_{1}^{b_{2}} | < < p$ . Although the procedure needs to be repeated B₂ times, as it can be executed in a highly parallel manner using multiple small computers, the resulted computer time can be significantly lower than that using straightforward penalization on a big computer. This is especially desirable when a high-performance big computer is not available. In “ordinary” penalized analysis, the relative importance of a covariate is measured by the magnitude of its estimate (especially zero versus nonzero). Here with the blocking structure, we compute the block importance measures via averaging. We have also considered other ways of averaging (for example $I_{k} = \sum_{j \in C_{k}} | B_{2}^{- 1} \sum_{b_{2} = 1}^{B_{2}} {\hat{β}}_{j}^{(b_{2})} |$ ) and obtained similar numerical results.

Step 3

Generate the final selection and estimation results.

Draw B₃ bootstrap samples – each with sample size n – by sampling with replacement from the original data.
For b₃ = 1 to B₃:
1. Select k₂ candidate blocks from the K blocks with selection probabilities proportional to the importance measure I_k ’s obtained in Step 2. Denote $E_{2}^{b_{3}} \subseteq {1, 2, \dots, K}$ as the index set of selected blocks of the b₃th bootstrap sample and $H_{2}^{b_{3}}$ as the corresponding index set of selected covariates.
2. Apply Lasso to the bootstrap sample and generate the estimate by minimizing
  $\sum_{i = 1}^{n} {(y_{i}^{b_{3}} - α - \sum_{j \in H_{2}^{b_{3}}} β_{j} x_{ij}^{b_{3}})}^{2} + λ \sum_{j \in H_{2}^{b_{3}}} | β_{j} | .$
  Here we use the superscript “b₃” to denote the b₃th bootstrap sample. Set the estimate for β_j = 0 for $j \in {1, \dots, p} \ H_{2}^{b_{3}}$ . Denote the resulted estimate as ${\hat{β}}_{j}^{(b_{3})}$ .
Conduct stability-based selection. Specifically, set β̂_j = 0 if $B_{3}^{- 1} \sum_{b_{3} = 1}^{B_{3}} I ({\hat{β}}_{j}^{(b_{3})} \neq 0) < π_{1}$ for j = 1, ⋯, p.
For j = 1, ⋯, p, if β̂_j ≠ 0 from step 3, calculate the final estimate as ${\hat{β}}_{j} = B_{3}^{- 1} \sum_{b_{3} = 1}^{B_{3}} {\hat{β}}_{j}^{(b_{3})}$ .

This step shares a similar strategy as Step 2. A major difference is that the blocks are not bootstrapped at random. As important blocks are “more interesting”, they are assigned higher probabilities of being selected. In the literature, most covariate importance measures have been defined based on estimation. For example in Wang et al. (2010), covariate j is defined as important if |β̂_j| > t_n, and t_n is set as 1/n in the numerical study. This selection of cutoff is somewhat subjective. In addition, when the estimation is not sufficiently stable (which is likely to happen when the sample size is small), the selection can be influenced by one or a small number of “outliers” in estimates. Our proposal has been motivated by the stability selection (Meinshausen and Buhlmann, 2010). We set a coefficient estimate equal to zero if the estimates in more than π₁ percentage of all bootstrap samples are zero. Here π₁ is estimated using a BIC criterion as ${\hat{π}}_{1}^{BIC} = {argmin}_{π_{1}} (nlog (n^{- 1} {‖ y - \hat{y} ‖}^{2}) + log (n) | \hat{A} (π_{1}) |)$ , where |Â(π₁)| is the number of nonzero estimates with cutoff π₁. The proposed variable selection criterion is data-driven and more appropriate than the estimation-based. It also has the advantage that the number of selected covariates is not limited by sample size.

Intuitively, larger values of $| H_{2}^{b_{2}} |$ and $| H_{2}^{b_{3}} |$ lead to estimates “closer” to that using the original data, but at the same time, higher computational burden. In practice, they can be selected as reasonably small values that minimize data-driven criteria (for example, cross validation-based). The values of B₂ and B₃ do not play an important role as long as they are not too small. λ can be selected using cross validation.

Remarks

In both Step 2 and 3, penalization is applied to only a subset of covariates. If this subset does not include all of the important covariates, then in general, the estimate is biased. This bias is caused by bootstrapping (for now, we ignore the possible bias caused by Lasso penalization). It diminishes if the important covariates in the selected subset are orthogonal to the important covariates not selected. This follows from a similar spirit as the sure independence screening. Correlations generally exist among covariates. Motivated by that, the proposed approach advances from the existing methods such as random Lasso in that blocks of covariates are generated and used as the basic units for bootstrapping. The goal of the blocks is to achieve weak correlations among covariates selected in bootstrap and those not selected. As the proposed approach can better accommodate correlations, it is expected that the estimates so generated are less biased than those with individual covariates. As to be demonstrated in numerical study, this can lead to improved selection and estimation. It is noted that under the extreme scenario where there are no (or very weak) correlations among covariates, a block may contain just a single covariate, and the proposed approach is then similar to the random Lasso.

3.2 Bootstrap penalization for data with a large n

For data with a large n, we resort to the “m out of n” bootstrap to dissect large data into small pieces and improve computational feasibility. The proposed method consists of the following steps.

Generate S bootstrap samples, each with size m(<< n), by sampling without replacement from the original data. With a large number of small computers and parallel computing, small m and large S can lead to significantly reduced computer time. Denote the subject index of the sth bootstrap sample as I_s.
For s = 1 to S:
1. Apply Lasso to subjects in I_s, and obtain the estimate as
  $({\hat{α}}^{(s)}, {\hat{β}}^{(s)}) = {argmin}_{α, β} \sum_{i \in I_{s}} {(y_{i} - α - \sum_{j = 1}^{p} β_{j} x_{ij})}^{2} + λ \sum_{j = 1}^{p} | β_{j} | .$
2. Compute the prediction mean squared error as PMSE^s = (y_{(I\I_s)} − α̂^(s) − X_{(I\I_s)}β̂^(s))^T (y_{(I\I_s)} − α̂^(s) − X_{(I\I_s)}β̂^(s)0)/|I \ I_s|, where y_{(I\I_s)} and X_{(I\I_s)} denote the response vector and design matrix for subjects not in I_s.
Conduct stability-based selection. That is, set β̂_j = 0 if $S^{- 1} \sum_{s = 1}^{S} I ({\hat{β}}_{j}^{(s)} \neq 0) < π_{2}$ for j = 1, …, p.
If β̂_j ≠ 0 from the previous step, then calculate the final estimate as ${\hat{β}}_{j} = \frac{1}{S} \sum_{s = 1}^{S} w^{s} {\hat{β}}_{j}^{(s)}$ . The weight w^s is computed as $w^{s} = \frac{max (\sqrt{{PMSE}_{0}^{s}} - \sqrt{{PMSE}^{s}}, 0)}{\sum_{s = 1}^{S} max (\sqrt{{PMSE}_{0}^{s}} - \sqrt{{PMSE}^{s}}, 0)}$ , where ${PMSE}_{0}^{s} = {(y_{(I \ I_{s})} - {\bar{y}}_{(I \ I_{s})})}^{T} (y_{(I \ I_{s})} - {\bar{y}}_{(I \ I_{s})}) / | I \ I_{s} |$ is the prediction mean squared error of the null model without covariates.

When it is difficult to manipulate and analyze data with a large sample size, we use bootstrap to generate multiple small datasets, each of which is computationally more feasible. The overall computer time can be significantly reduced with parallel computing. The final estimate is taken as the weighted average over bootstrap runs. In Step 3, π₂ is chosen as ${\hat{π}}_{2}^{BIC} = {argmin}_{π_{2}} (nlog (n^{- 1} {‖ y - \hat{y} ‖}^{2}) + log (n) | \hat{A} (π_{2}) |)$ , in the same manner as in the previous section. In Step 4, when averaging over bootstrap estimates, we assign larger weights to models with better prediction performance on independent subjects. The max function ensures that the weights are non-negative. The weights may be less necessary when the bootstrap samples are sufficiently large and homogeneous. However, its necessity becomes more prominent when m and S are not too large and when bootstrap in both the covariate and sample dimensions is needed as in the next section.

3.3 Bootstrap penalization for data with a large p and a large n

When both the number of covariates and sample size are large, the methods developed in the previous two sections can be naturally “combined” for analysis. The proposed method consists of the following steps.

Generate S bootstrap samples, each with size m(<< n), by sampling without replacement from the original data. Denote the subject index of the sth bootstrap sample as I_s.
For s = 1 to S:
1. Apply the block bootstrap penalization method developed in Section 3.1 to the sth bootstrap sample, and obtain the estimate as β̂^(s).
2. Compute the prediction mean squared error as
  ${PMSE}^{s} = {(y_{(I \ I_{s})} - {\hat{α}}^{(s)} - X_{(I \ I_{s})} {\hat{β}}^{(s)})}^{T} (y_{(I \ I_{s})} - {\hat{α}}^{(s)} - X_{(I \ I_{s})} {\hat{β}}^{(s)}) / | I \ I_{s} | .$
Conduct stability-based selection. Specifically, set β̂_j = 0 if $S^{- 1} \sum_{s = 1}^{S} I ({\hat{β}}_{j}^{(s)} \neq 0) < π_{3}$ for j = 1, …, p. π₃ is chosen using a BIC-type criterion as previously described.
If β̂_j ≠ 0 from the previous step, then calculate the final estimate as ${\hat{β}}_{j} = \frac{1}{S} \sum_{s = 1}^{S} w^{s} {\hat{β}}_{j}^{(s)}$ . The weight w^s is computed as $w^{s} = \frac{max (\sqrt{{PMSE}_{0}^{s}} - \sqrt{{PMSE}^{s}}, 0)}{\sum_{s = 1}^{S} max (\sqrt{{PMSE}_{0}^{s}} - \sqrt{{PMSE}^{s}}, 0)}$ , where ${PMSE}_{0}^{s} = {(y_{(I \ I_{s})} - {\bar{y}}_{(I \ I_{s})})}^{T} (y_{(I \ I_{s})} - {\bar{y}}_{(I \ I_{s})}) / | I \ I_{s} |$ is the prediction mean squared error of the null model without covariates.

Note that different from that in Section 3.2, each analyzed bootstrap sample is likely to contain a different set of covariates. An “unlucky” bootstrap may result in a majority of important covariates not in the selected blocks, which can lead to a bad model with poor prediction. The weights in Step 4, which depend on prediction performance, are especially desirable to down-weigh such bootstrap runs.

3.4 R software development

To facilitate implementation of the proposed approach and also to better guide understanding its applications, we have developed the R package bootpenal. The software manual and source code are now publicly available at https://github.com/ruiqwy/bootpenal.git. As in this article, it accommodates the linear regression model with Lasso and adaptive Lasso penalization (more details on the adaptive Lasso below). Note that the proposed approach demands parallel computing, and so some adjustments of the software may be needed depending on the hardware and software configurations. As the source code is publicly available, researchers can modify to accommodate other models and regularization techniques.

4 Simulation

We conduct simulation to examine performance of the proposed approach and compare with the existing alternatives. Six examples are simulated. Examples 1–3 have “classic” low-dimensional covariates and small sample sizes. They are provided to show that the proposed approach is not limited to large datasets. Specifically, these three examples are very similar to those in the existing studies such as Wang et al. (2010), which facilitates comparison. Examples 4–6 have a large p and a small n, a large n and a small p, and a large p and a large n, respectively, matching the methodological development in Sections 3.1–3.3.

4.1 Simulation settings

Example 1

Consider p = 8, n = 50, 100, and σ = 1, 3, 6. The eight covariates form five blocks. Each of the first three blocks contains two covariates generated from a standard multivariate normal distribution with correlation coefficient 0.9. Each of the last two blocks contains a single covariate generated from the standard normal distribution. Covariates in different blocks are independent. The true regression coefficient vector is $β = {(\underset{︸}{3, - 2}, \underset{︸}{1.5, - 1}, \underset{︸}{2, - 0.8}, 0, 0)}^{T}$ .

Example 2

Consider p = 40, n = 100, and σ = 1, 3, 6. Covariates are generated from a multivariate normal distribution with marginal means zero. Among the first ten covariates, the pairwise correlation coefficients are all 0.9. The rest 30 covariates are independent of each other and also independent of the first ten covariates. The first ten coefficients are nonzero. Specifically, the regression coefficient vector is $β = {(\underset{︸}{3, 3, 3, 3, 3, - 2, - 2, - 2, - 2, - 2}, 0, \dots, 0)}^{T}$ .

Example 3

Consider p = 40, n = 100, and σ = 1, 3, 6. Covariates are generated from a multivariate normal distribution with marginal means zero. Among covariates 1–3 and 4–6, respectively, the pairwise correlation coefficients are 0.9. Otherwise, the pairwise correlation coefficients are 0. The first six covariates are important. Specifically, the regression coefficient vector is $β = {(\underset{︸}{3, 3, - 2}, \underset{︸}{3, 3, - 2}, 0, \dots, 0)}^{T}$ .

Example 4

Consider p = 1000 and n = 100. The first ten coefficients are randomly drawn from N (3, 0.5), the next ten coefficients are randomly drawn from N (−3, 0.5), and their values are then fixed for all simulation runs. The remaining 980 coefficients are set as zero. Covariates are generated from a multivariate normal distribution with marginal means zero. The covariance matrix has the form

[\begin{matrix} \sum_{1} & \sum_{2} & 0 & 0 & 0 & 0 \\ \sum_{2} & \sum_{1} & 0 & 0 & 0 & 0 \\ 0 & 0 & \sum_{3} & 0 & 0 & 0 \\ 0 & 0 & 0 & \sum_{4} & 0 & 0 \\ 0 & 0 & 0 & 0 & \sum_{4} & 0 \\ 0 & 0 & 0 & 0 & 0 & \sum_{5} \end{matrix}] .

Σ₁ is a 5×5 matrix with the diagonal elements equal to 1 and off-diagonal elements equal to 0.9. Σ₂ = 0.3J, where J is a 5×5 matrix with unit elements. Σ₃ is a 10×10 matrix with the diagonal elements equal to 1 and off-diagonal elements equal to 0.9. Σ₄ is a 40×40 matrix with the diagonal elements equal to 1 and off-diagonal elements equal to 0.7. Σ₅ has the form

[\begin{matrix} \sum_{6} & \dots & 0 \\ ⋮ & ⋱ & ⋮ \\ 0 & \dots & \sum_{6} \end{matrix}] .

It contains 20 copies of matrix Σ₆, which is a 45×45 matrix with the diagonal elements equal to 1 and off-diagonal elements equal to 0.4.

Example 5

Consider p = 8, n = 10⁶, and σ = 1, 3. Covariates are generated from a multivariate normal distribution with marginal means zero. The pairwise correlation coefficient between x_j and x_k is ρ(j, k) = 0.5^|j−k|. The regression coefficient vector is β = (3, 0, 1.5, 0, 2, 0, 0, 0)^T.

Example 6

This example is the same as Example 4, except that n = 10⁶.

Under all examples, the intercepts are set as 0, the random errors have a standard normal distribution, and the responses are generated from the linear regression model.

4.2 Results

When applying the proposed approach, we consider both the Lasso and adaptive Lasso (ALasso) penalties. As a variation of Lasso, ALasso has been found to have better statistical and numerical properties. We acknowledge that there are many other penalties that are also applicable. However, it is not our intention to compare different penalties, and we focus on Lasso and ALasso. With the two penalties, the proposed approach is referred to as B.Lasso and B.ALasso, respectively, where “B” stands for bootstrap. For the tunings, (a) we set B_i = 200. We find that this value does not have a big impact on results, as long as it is not too small; (b) The tuning parameter λ in the Lasso (ALasso) penalization is selected using V-fold cross validation, which is the default in many studies; (c) When bootstrapping covariates, k₁ and k₂ are chosen using the approach in Wang et al. (2010); (d) When bootstrapping subjects, we experiment with a few n/m values.

For comparison, we also apply the Lasso, ALasso, and Enet (elastic net) penalties directly. Note that, we have designed the simulated data to be “not too big” so that this is feasible. Another alternative we consider is the random Lasso (RLasso), which applies a bootstrap approach to individual covariates. RLasso is applied in exactly the same way as in Wang et al. (2010). We have tried to take advantage of the existing R packages. Specifically, Lasso estimates are computed using the R package glmnet, and ALasso estimates are computed using msgps. When applying the proposed approach and RLasso, we adopt parallel computing with a large number of CPUs (each of which has standard configurations) available from the Yale High Performance Computing (HPC) service, and each small dataset is analyzed using a different CPU.

When evaluating the proposed approach and alternatives, we are interested in (a) computer time (seconds), (b) variable selection performance measured using FNR (false negative rate) and FDR (false discovery rate), and (c) relative model error defined as RME = (β̂ − β)^T Σ(β̂ − β)/σ², where β̂ is the estimate of β and Σ is the covariance matrix of covariates. Note that differences in computer time between Lasso and ALasso based approaches are also attributable to differences in R packages and computational algorithms.

For the “classic” small data Examples 1–3, the variable selection results are shown in Table 5, 7, and 8 (Appendix). The relative model errors are shown in Table 9 (Appendix). As a representative example, we also provide detailed estimation and selection results for Example 1 in Table 6 (Appendix). Those for the other examples are available from the authors. For these three examples, the computer time for all methods is very low and not provided. It is observed that the proposed approach has superior variable selection performance. Consider, for example, Example 2 with n = 100 and σ = 3. The FNRs are 24.37% (Lasso), 15.22% (ALasso), 23.66% (Enet), 3.61% (RLasso), 1.10% (B.Lasso), and 2.57% (B.ALasso), respectively. The FDRs are 44.71% (Lasso), 67.84% (ALasso), 41.51% (Enet), 68.40% (RLasso), 28.78% (B.Lasso), and 28.25% (B.ALasso), respectively. Table 9 shows that the proposed method also provides more accurate estimation of the regression coefficients.

The results under Example 4 are shown in Table 1. With parallel computing, the computer time of RLasso is the lowest. The computer time of B.Lasso falls between that of Lasso and ALasso. Overall, the proposed approach is computationally feasible. All methods have low FNRs. The proposed approach outperforms the alternatives with much lower FDRs. B.Lasso has the lowest RME, although when accounting for variation, the difference may not be significant.

Table 1.

Simulation results under Example 4: mean (sd) over 100 replicates. Time: computer time (seconds) for 100 replicates. For RLasso, B.Lasso, and B.ALasso, parallel computing is applied. RME is 1000×average relative model error. FNR and FDR: %.

		σ = 1			σ = 3

	Time	RME	FNR	FDR	RME	FNR	FDR
Lasso	51.28	726.29	0.00	26.46	803.12	0.24	40.50
	-	(125.80)	(0.00)	(7.81)	(110.00)	(1.05)	(16.70)
ALasso	80.46	995.28	1.04	88.14	1010.01	4.48	89.10
	-	(126.66)	(2.18)	(0.54)	(135.34)	(4.03)	(0.39)
Enet	50.79	796.34	0.00	28.11	864.33	0.00	40.70
	-	(125.87)	(0.00)	(9.16)	(112.04)	(0.00)	(15.70)
RLasso	38.58	814.91	0.00	46.17	796.01	0.00	47.70
	-	(127.62)	(0.00)	(14.43)	(244.52)	(0.00)	(18.90)
B.Lasso	63.45	724.50	0.93	0.00	778.21	2.4	0.46
	-	(122.43)	(3.58)	(0.00)	(155.63)	(4.9)	(1.75)
B.ALasso	99.73	914.66	1.13	1.21	904.03	2.68	6.18
	-	(135.11)	(2.23)	(2.61)	(134.82)	(1.25)	(4.76)

Open in a new tab

The results under Example 5 are shown in Table 2. All methods have very satisfactory variable selection and estimation results, as “large n, small p” poses the simplest case. The proposed approach can reduce computer time with parallel computing, especially when n/m is large. Specifically, with 100 replicates, Lasso and Enet take about 2,041 and 2,179 seconds, respectively. The computer time of RLasso is similar. With n/m = 10, 50 and 100, B.Lasso takes 1,054, 238, and 126 seconds, respectively.

Table 2.

Simulation results under Example 5: mean (sd) over 100 replicates. Time: computer time (seconds) for 100 replicates. For RLasso, B.Lasso, and B.ALasso, parallel computing is applied. RME is 1000×average relative model error. FNR and FDR: %.

		σ = 1			σ = 3

	Time	RME	FNR	FDR	RME	FNR	FDR
Lasso	2040.59	1.23	0.00	0.00	1.40	0.00	0.00
	-	(0.07)	(0.00)	(0.00)	(0.02)	(0.00)	(0.00)
ALasso	16985.34	0.01	0.00	4.130	0.01	0.00	4.34
	-	(0.04)	(0.00)	(1.38)	(0.01)	(0.00)	(1.37)
Enet	2179.40	1.40	0.00	5.38	1.50	0.00	3.99
	-	(0.08)	(0.00)	(0.65)	(0.02)	(0.00)	(1.36)
RLasso	2051.63	1.22	0.00	0.00	1.39	0.00	0.00
	-	(0.08)	(0.00)	(0.00)	(0.04)	(0.00)	(0.00)
B.Lasso(n/m = 10)	1053.67	1.23	0.00	0.00	1.39	0.00	0.00
	-	(0.07)	(0.00)	(0.00)	(0.05)	(0.00)	(0.00)
B.Lasso(n/m = 50)	237.74	1.28	0.00	0.00	1.41	0.00	0.00
	-	(0.07)	(0.00)	(0.00)	(0.04)	(0.00)	(0.00)
B.Lasso(n/m = 100)	125.76	1.29	0.00	0.00	1.43	0.00	0.00
	-	(0.09)	(0.00)	(0.00)	(0.07)	(0.00)	(0.00)
B.ALasso(n/m = 10)	8540.32	0.01	0.00	1.70	0.01	0.00	3.52
	-	(0.01)	(0.00)	(1.68)	(0.03)	(0.00)	(1.56)
B.ALasso(n/m = 50)	1788.64	0.06	0.00	0.70	0.07	0.00	1.85
	-	(0.01)	(0.00)	(1.23)	(0.04)	(0.00)	(1.49)
B.ALasso(n/m = 100)	896.37	0.06	0.00	0.56	0.08	0.00	0.82
	-	(0.03)	(0.00)	(0.47)	(0.05)	(0.00)	(0.60)

Open in a new tab

The results under Example 6, which is the most difficult scenario, are presented in Table 3. With a large sample size, all methods have satisfactory variable selection results. The proposed approach leads to smaller RMEs, although the differences between different methods are not significant. With parallel computing, it can significantly reduce computer time. Specifically, with 100 replicates, Lasso and Enet take 126,758 and 127,049 seconds, respectively. The computer time of ALasso is much higher. RLasso takes 104,550 seconds. With n/m = 10 and 100, respectively, B.ALasso takes 60,794 and 6,064 seconds.

Table 3.

Simulation results under Example 6: mean (sd) over 100 replicates. Time: computer time (seconds) for 100 replicates. For RLasso, B.Lasso, and B.ALasso, parallel computing is applied. RME is 1000×average relative model error. FNR and FDR: %.

		σ = 1			σ = 3

	Time	RME	FNR	FDR	RME	FNR	FDR
Lasso	126758.13	12.73	0.00	1.16	15.94	0.00	1.31
	-	(1.66)	(0.00)	(0.58)	(2.17)	(0.00)	(0.37)
ALasso	379411.76	11.88	0.00	26.52	14.29	0.00	28.22
	-	(1.34)	(0.00)	(2.72)	(1.88)	(0.00)	(2.98)
Enet	127049.21	12.03	0.00	4.52	13.83	0.00	6.94
	-	(1.36)	(0.00)	(0.71)	(1.75)	(0.00)	(1.12)
RLasso	104549.65	12.21	0.00	0.00	14.99	0.00	0.66
	-	(1.48)	(0.00)	(0.00)	(0.04)	(0.00)	(0.34)
B.Lasso(n/m = 10)	60793.56	9.43	0.00	0.00	11.71	0.00	0.00
	-	(1.47)	(0.00)	(0.00)	(1.67)	(0.00)	(0.00)
B.Lasso(n/m = 100)	6064.22	11.25	0.00	0.00	12.86	0.00	0.00
	-	(1.62)	(0.00)	(0.00)	(1.71)	(0.00)	(0.00)
B.ALasso(n/m = 10)	181163.23	8.79	0.00	0.00	10.84	0.00	0.00
	-	(1.33)	(0.00)	(0.00)	(1.64)	(0.00)	(0.00)
B.ALasso(n/m = 100)	18161.72	10.28	0.00	0.00	12.40	0.00	0.62
	-	(1.77)	(0.00)	(0.00)	(1.64)	(0.00)	(0.13)

Open in a new tab

5 Data analysis

5.1 Data descriptions

Mice gene expression data

This study is first reported in Scheetz et al. (2006) and later analyzed in several follow-up studies. In this study, F1 mice were intercrossed, and 120 twelve-week old male offsprings were selected for tissue harvesting from the eyes and for microarray analysis. The microarray chips used for profiling contain about 31,042 probes. The intensity values are normalized using the robust multi-chip averaging (RMA) method to generate summary expression values. To generate more reliable results, we conduct screening. First, probe sets are removed if they are not expressed in the eye or lack sufficient variation. The definition of “expressed” is based on the empirical distribution of the RMA normalized values. More specifically, for a probe set to be considered as expressed, the maximum expression value observed for that probe is required to be bigger than the 25th percentile of the entire set of the RMA expression values. For a probe to be considered as “sufficiently variable”, it needs to have at least two-fold variation in expression level. A total of 18,976 probes pass this unsupervised screening. The analysis goal, as suggested in Scheetz et al. (2006), is to identify genes correlated with gene TRIM32, which causes the Barde-Biedl syndrome – a genetically heterogeneous disease of multiple organ system including the retina. Following published studies (Huang et al. 2008), we further conduct the following supervised screening. We compute the correlation coefficient of each gene expression with that of TRIM32 and select the top 1,000 genes with the largest absolute values of correlations. The selected gene expressions are then normalized to have zero means and unit variances.

Diabetes data

This dataset has been analyzed in Efron et al. (2004) and others and available in the R package lars. Detailed descriptions, which are available in multiple published studies, are omitted here. Briefly, this dataset has n = 422 and p = 10 (n >> p) and is an example of “classic” datasets. The response variable is a quantitative measure of disease progression one year after baseline, and the covariates include age (X₁), sex(X₂), body mass index (X₃), average blood pressure (X₄), and six blood serum measurements (X₅ − X₁₀).

CHFS data

With China’s fast economic growth, household income has been increasing significantly during the past decades. It is of significant interest to study the patterns and determinants of household consumption in China. The analyzed dataset has been generated from the China Household Finance Survey (CHFS), which has been conducted by the Survey and Research Center of China Household Finance at Southwestern University of Finance and Economics in China. This survey is the only nationally representative survey in China that has detailed information on household finance and assets, including financial assets, business assets, housing, household income and expenditures and other household assets. Detailed information on the dataset and survey study is available at http://www.chfsdata.org as well as publications such as Gan et al. (2013). We remove records with missing measurements and also conduct other data preprocessing. The analyzed dataset contains 7,825 records and 987 variables. The response variable of interest is the logarithm of household consumption expenditure (including expenditure on food, clothing, housing renovation, heating, household durables). More information on the 986 covariates is available from the authors.

5.2 Analysis results

The three datasets are analyzed using the methods described in Section 3.1–3.3, respectively. We apply the proposed B.Lasso and B.ALasso as well as the alternative methods previously described. When applying the proposed methods to the CHFS data, we set m = n/10. We observe comparable computational feasibility and reduction in computer time of the proposed approach as in simulation (details omitted). We further examine the estimation results and show the summary in Table 4. For the diabetes dataset which has a small number of covariates, we provide the estimation results of all methods in Table 11 (Appendix). For the mice gene expression dataset and CHFS dataset, we provide the B.Lasso estimation results. Results using the other methods are available from the authors. Table 4 suggests that the proposed approach identifies fewer covariates, which is usually preferred in practice. For example for the CHFS dataset, B.Lasso and B.ALasso identify 57 and 50 covariates, respectively, compared to 66 (Lasso), 58 (ALasso), 60 (Enet), and 59 (RLasso). Comparing the estimation results (see for example Table 11; more available from the authors) suggests that different methods lead to different nonzero estimates. We have manually examined the selected variables and their estimates (“directions” as well as magnitudes) and found that the analysis results are reasonably meaningful. To avoid distraction, we omit further discussions. As there is a lack of objective measure on the “degree of meaningfulness”, we conduct a cross-validation based prediction evaluation, which can provide indirect support to the estimation and identification results. Specifically, a dataset is split into a training and a testing set, with size 2:1. A model is generated using the training data and used to make prediction for the testing data. With 100 random splits, we compute the average prediction MSE (PMSE) and show the results in Table 4. The proposed B.Lasso and B.ALasso have slightly better prediction performance.

Table 4.

Summary data analysis results. cov: number of selected covariates.

	cov	PMSE
Mice gene expression data
Lasso	62	0.000297
ALasso	49	0.000274
Enet	51	0.000283
RLasso	49	0.000275
B.Lasso	47	0.000251
B.ALasso	40	0.000218

Diabetes data
Lasso	8	0.538
ALasso	8	0.529
Enet	10	0.540
RLasso	8	0.532
B.Lasso	8	0.510
B.ALasso	7	0.498

CHFS data
Lasso	66	0.99
ALasso	58	0.94
Enet	60	0.95
RLasso	59	0.93
B.Lasso	57	0.82
B.ALasso	50	0.74

Open in a new tab

6 Discussion

Data collected in more recent studies are getting bigger and bigger. Statistical analyses that can be easily conducted with small datasets get problematic with big datasets. In this study, we consider regularized estimation, especially penalization, which has been commonly conducted in the recent literature. We have developed a bootstrap penalization approach. The development has been motivated by the fact that in practice we often have access to a large number of small computers as opposed to a big one. The proposed approach has been shown to be able to significantly improve computational feasibility and reduce computer time with the assistance of parallel computing. In addition, it also has several desired “byproducts”. Specifically, the number of selected important covariates is not limited by sample size, which is a limitation of ordinary penalization. In addition, by analyzing a smaller number of covariates and samples, the stability of estimates can be improved. Simulation provides convincing evidence on the computational superiority of proposed approach. In addition, it is also shown to have superior variable selection and estimation results. Although without a rigorous proof, we conjecture that the superiority in selection and estimation may be explained by the lack of computational stability when directly applying penalization to big datasets. It is noted that a similar advantage has also been observed in the RLasso study. In data analysis, the computational advantage of proposed approach is also observed. In addition, it identifies a smaller number of important covariates (which can lead to more stable models) with better prediction performance.

The proposed approach may not always be applicable. Specifically, when bootstrapping covariates, it requires that the covariates can be separated into weakly correlated blocks. This can be infeasible (consider, for example, the compound symmetry correlation). The proposed approach involves more steps, and hence more tunings, than the straightforward application of penalization. However, it should be noted that some parameters, for example B₂ and B₃ in Section 3.1, have very ignorable impact on the analysis results. Others, such as K, k₁, and k₂ in Section 3.1, can be well determined data-dependently. In addition, it should be noted that these parameters are determined sequentially as opposed to simultaneously, and thus the computational cost is affordable. We have described the proposed approach using linear regression and Lasso based penalties. The Lasso penalty can be replaced by many other penalties and potentially other regularization methods. The linear model can also be replaced by other models, for example generalized linear models, mixed effects models, and nonlinear models. Such extensions demand additional numerical studies and are postponed to future research. In this article, we have focused on methodological development and numerical study. The theoretical aspect has not been touched and is expected to be challenging. Some recent studies have examined the bias and variance problems associated with bootstrap. See for example Janitza et al. (2016). A closer examination suggests that such results may not be directly applicable. However, it may be of interest to examine our proposed method along the line of Janitza et al. (2016). Another potentially interesting theoretical aspect is that when the covariate blocks are not completely uncorrelated, the penalized estimates from the bootstrap runs are expected to be biased. The degree of biasedness is difficult to quantify. The final estimates are weighted averages with data-dependent weights. Sophisticated probability and empirical process techniques will be needed to examine the asymptotic properties.

Acknowledgments

We thank the editor and reviewers for careful review and insightful comments, which have led to a significant improvement of the article. This study was supported by the NIH (CA204120), National Natural Science Foundation of China (71471152, 71201139, 71303200, 71301162), and National Social Science Foundation of China (13&ZD148, 13CTJ001).

Appendix

Table 5.

Variable selection results under Example 1: mean (sd) percentage over 100 replicates.

	(n = 50, σ = 1)		(n = 50, σ = 3)		(n = 50, σ = 6)

	FNR	FDR	FNR	FDR	FNR	FDR

Lasso	2.17	23.82	18.32	16.96	37.17	8.43
	(5.89)	(3.36)	(16.74)	(10.41)	(14.9)	(10.19)
ALasso	2.07	21.75	18.19	17.86	31.90	12.17
	(5.56)	(6.23)	(13.39)	(8.73)	(11.32)	(9.38)
Enet	1.25	24.46	16.06	1 9.17	34.68	9.61
	(4.37)	(2.35)	(15.55)	(9.30)	(16.30)	(10.37)
RLasso	1.14	16.00	2.14	21.17	2.82	21.93
	(3.89)	(8.34)	(5.12)	(5.81)	(5.94)	(5.56)
B.Lasso	0.00	0.00	3.25	0.00	21.22	0.00
	(0.00)	(0.00)	(8.67)	(0.00)	(18.92)	(0.00)
B.ALasso	0.00	0.00	1.39	0.00	12.96	0.00
	(0.00)	(0.00)	(4.56)	(0.00)	(14.44)	(0.00)

	(n = 100, σ = 1)		(n = 100, σ = 3)		(n = 100, σ = 6)

	FNR	FDR	FNR	FDR	FNR	FDR

Lasso	0.00	21.86	8.78	21.14	28.08	12.71
	(0.00)	(5.25)	(11.12)	1(6.95)	(16.31)	(11.33)
ALasso	0.00	21.75	9.46	20.86	25.86	14.61
	(0.00)	21.75	(9.75)	(6.75)	(11.96)	(8.89)
Enet	0.00	24.25	6.47	22.86	27.32	12.96
	(0.00)	(2.75)	(9.92)	(5.42)	(15.73)	(10.84)
RLasso	0.14	14.82	0.28	22.03	1.71	23.28
	(0.14)	(9.01)	(2.01)	(5.51)	(4.66)	(3.95)
B.Lasso	0.00	0.00	0.00	0.00	9.51	0.00
	(0.00)	(0.00)	(0.00)	(0.00)	(14.87)	0.00
B.ALasso	0.00	0.00	0.00	0.00	7.32	0.00
	(0.00)	(0.00)	(0.00)	(0.00)	(11.90)	(0.00)

Open in a new tab

Table 6.

Coefficient estimation and variable selection under Example 1 (n = 100, σ = 1): summarized over 100 replicates.

	β₁	β₂	β₃	β₄	β₅	β₆	β₇	β₈

True value	3	−2	1.5	−1	2	−0.8	0	0
Lasso
est.	2.952	−1.948	1.474	−0.977	1.947	0.751	−0.005	0.008
sd	(0.224)	(0.223)	(0.209)	(0.210)	(0.234)	(0.223)	(0.093)	(0.093)
% of selected	100	100	100	100	100	100	98	90
ALasso
est.	2.886	−1.883	1.403	−0.907	1.883	−0.685	−0.005	0.009
sd	(0.239)	(0.231)	(0.233)	(0.236)	(0.235)	(0.223)	(0.090)	(0.090)
% of selected	100	100	100	100	100	100	88	83
Enet
est.	2.948	−1.945	1.481	−0.985	1.955	−0.758	−0.005	0.009
sd	(0.226)	(0.222)	(0.210)	(0.211)	(0.232)	(0.220)	(0.094)	(0.094)
% of selected	100	100	100	100	100	100	99	94
RLasso
est.	2.911	−1.905	1.355	−0.854	1.719	−0.498	−0.001	0.003
sd	(0.237)	(0.231)	(0.252)	(0.265)	(0.276)	(0.281)	(0.045)	(0.050)
% of selected	100	100	100	100	100	100	55	57
B.Lasso
est.	2.960	−1.955	1.486	−0.989	1.956	−0.761	0.000	0.000
sd	(0.222)	(0.218)	(0.214)	(0.214)	(0.231)	(0.219)	(0.000)	(0.000)
% of selected	100	100	100	100	100	100	0	0
B.ALasso
est.	2.945	−1.941	1.466	−0.970	1.933	−0.737	0.000	0.000
sd	(0.223)	(0.222)	(0.220)	(0.220)	(0.231)	(0.219)	(0.000)	(0.000)
% of selected	100	100	100	100	100	100	0	0

Open in a new tab

Table 7.

Variable selection results under Example 2: mean (sd) percentage over 100 replicates.

	σ = 1		σ = 3		σ = 6

	FNR	FDR	FNR	FDR	FNR	FDR
Lasso	0.00	71.41	24.37	44.71	40.54	39.98
	(0.00)	(2.06)	(15.74)	(28.29)	(22.74)	(20.81)
ALasso	0.00	71.71	15.22	67.84	33.07	8.51
	(0.00)	(1.99)	(10.21)	(4.04)	(61.85)	(6.86)
Enet	0.00	72.91	23.66	41.51	35.20	52.71
	(0.00)	(1.81)	(13.68)	(27.81)	(5.07)	(19.56)
RLasso	0.45	62.59	3.61	68.40	14.30	70.98
	(1.99)	(3.66)	(1.79)	(1.96)	(3.52)	(1.75)
B.Lasso	0.34	26.72	1.10	28.78	19.07	25.46
	(2.08)	(9.45)	(4.51)	(7.10)	(13.97)	(7.78)
B.ALasso	0.00	34.54	2.57	28.25	12.72	28.15
	(0.00)	(5.66)	(5.39)	(6.83)	(9.54)	(6.76)

Open in a new tab

Table 8.

Variable selection results under Example 3: mean (sd) percentage over 100 replicates.

	σ = 1		σ = 3		σ = 6

	FNR	FDR	FNR	FDR	FNR	FDR
Lasso	0.39	79.11	22.70	45.32	27.00	42.68
	(2.86)	(8.55)	(7.33)	(23.42)	(7.50)	(23.00)
ALasso	0.00	80.17	14.74	76.6	24.37	73.25
	(0.00)	(1.78)	(10.50)	(4.55)	(7.66)	(4.82)
Enet	0.39	81.53	22.9	53.12	21.76	49.20
	(2.86)	(2.74)	(5.73)	(16.09)	(6.54)	(18.74)
RLasso	0.00	73.7	1.41	80.46	15.70	82.01
	(0.00)	(5.17)	(1.42)	(1.52)	(2.81)	(1.20)
B.Lasso	0.64	16.31	3.21	20.22	17.37	22.01
	(3.77)	(12.38)	(7.93)	(11.15)	(11.57)	(11.11)
B.ALasso	0.00	23.77	2.21	20.31	15.28	22.69
	(0.00)	(11.28)	(5.69)	(11.67)	(11.35)	(11.50)

Open in a new tab

Table 9.

Relative model errors (× 1000) under Example 1–3: mean (sd) over 100 replicates.

		OLS	Lasso	ALasso	Enet	RLasso	B.Lasso	B.ALasso
Example 1
n = 50, p = 8	σ = 1	162.7	179.4	189.6	173.7	194.2	132.2	138.9
		(61.4)	(73.5)	(74.6)	(67.3)	(69.8)	(60.3)	(61.9)
	σ = 3	164.4	190.1	170.6	181.8	143.4	128.1	125.1
		(77.6)	(78.1)	(73.9)	(75.3)	(59.9)	(65.9)	(64.4)
	σ = 6	161.1	123.1	108.9	1118.2	99.3	108.3	91.0
		(71.8)	(76.0)	(45.6)	(43.5)	(49.3)	(58.3)	(45.1)
n = 100, p = 8	σ = 1	69.5	70.1	77.2	70.2	78.7	53.6	55.9
		(33.1)	(33.2)	(36.3)	(33.2)	(48.0)	(30.3)	(31.8)
	σ = 3	76.5	87.9	84.1	85.7	74.2	58.9	61.2
		(33.7)	(40.8)	(37.3)	(39.9)	(32.9)	(27.8)	(29.7)
	σ = 6	75.7	76.4	64.5	73.2	56.7	54.0	49.7
		(41.1)	(37.7)	(29.8)	(33.7)	(31.7)	(31.3)	(27.7)

Example 2
n = 100, p = 40	σ = 1	396.7	363.9	394.8	378.1	161.6	142.3	103.2
		(79.3)	(80.6)	(89.1)	(82.1)	(26.3)	(26.2)	(39.8)
	σ = 3	412.5	406.9	359.2	423.6	143.3	129.9	130.7
		(52.3)	(67.7)	(57.8)	(66.1)	(37.4)	(35.9)	(36.2)
	σ = 6	393.9	176.7	224.5	173.6	144.7	160.5	131.8
		(81.1)	(43.7)	(59.1)	(47.4)	(33.6)	(35.7)	(36.1)

Example 3
n = 100, p = 40	σ = 1	403.1	340.4	343.1	362.4	147.2	121.1	99.1
		(93.5)	(110.7)	(90.6)	(104.7)	(60.1)	(63.1)	(45.6)
	σ = 3	407.1	246.4	261.2	235.1	94.2	76.6	81.7
		(71.6)	(65.1)	(50.2)	(46.3)	(32.0)	(43.4)	(34.0)
	σ = 6	401.2	146.2	179.8	138.3	79.1	73.4	69.1
		(49.0)	(43.4)	(50.9)	(46.5)	(24.1)	(28.2)	(28.5)

Open in a new tab

Table 10.

Analysis of the mice gene expression data: probes identified using B.Lasso and their estimates.

probe	coefficient	probe	coefficient	probe	coefficient
1369407_at	0.001087	1369913_at	0.001335	1371052_at	0.000691
1371151_at	−0.000816	1371257_at	−0.001045	1374302_at	0.001462
1375230_at	0.001735	1377501_at	0.001539	1378367_at	0.000702
1378392_at	0.001229	1378962_at	0.000942	1379506_at	−0.003850
1383468_at	0.000856	1383577_at	0.000326	1387060_at	0.001826
1387902_at	0.000576	1390636_at	0.000974	1393751_at	0.000537
1377950_at	0.001009	1378003_at	0.000855	1378493_at	0.000877
1379023_at	0.001678	1379607_at	−0.000807	1379748_at	0.000687
1379830_at	0.000195	1380332_at	0.000709	1381646_at	0.002420
1382263_at	0.001464	1382452_at	0.003584	1382633_at	0.007533
1383841_at	0.003024	1384049_at	0.001230	1385043_at	0.001226
1386061_at	0.001682	1386794_at	0.001871	1391654_at	0.001045
1392605_at	−0.001035	1392692_at	0.000774	1393499_at	0.000559
1393684_at	0.001139	1393735_at	0.001442	1393746_at	0.001188
1393817_at	0.004705	1394459_at	0.001127	1394689_at	0.000228
1395237_at	0.001247	1395415_at	0.000967	1395716_at	0.001844
1395772_at	0.000639	1396257_at	0.000548	1396775_at	0.001194
1397489_at	0.000867	1398568_at	0.001297	1398594_at	0.000837
1398736_at	0.000917

Open in a new tab

Table 11.

Analysis of the diabetes data: estimated regression coefficients. “-” represents variables not selected.

variable	Lasso	ALasso	Enet	RLasso	B.Lasso	B.ALasso
x₁	−0.035	− 0.001	−0.049	-	-	-
x₂	−0.186	−0.170	−0.201	−0.039	−0.149	−0.164
x₃	0.251	0.269	0.254	0.299	0.243	0.264
x₄	0.198	0.183	0.211	0.190	0.167	0.182
x₅	−0.086	−0.475	−0.468	-	−0.334	−0.332
x₆	-	0.318	0.297	0.032	0.203	0.202
x₇	−0.179	-	−0.023	−0.087	−0.068	-
x₈	-	-	0.044	0.009	-	-
x₉	0.379	0.551	0.514	0.436	0.498	0.500
x₁₀	0.084	0.066	0.099	0.027	0.050	0.051

Open in a new tab

Table 12.

Analysis of the CHFS data: variables identified using B.Lasso and their estimates.

Variable	coefficient	Variable	coefficient
household size	0.012	years of education of household head	0.029
if house is rented^*	−0.112	having more than one housing property^*	0.026
value of house	0.113	number of bedrooms	0.062
distance from home to the nearest downtown area	−0.065	size of house	0.045
if living in the city^*	0.078	if have full ownership^*	0.059
having loan from bank^*	−0.221	amount of unpaid loan	−0.074
interest rate of loan	−0.017	if borrow besides from bank^*	−0.076
amount of financial compensation for relocation	0.169	land expropriated^*	0.045
having financial products^*	0.049	transferred income	0.013
amount of financial compensation for land expropriated	0.121	welfare subsidized^*	−0.140
owning a car^*	0.175	buying car using bank loan^*	−0.047
having vehicle insurance^*	0.096	vehicle rental revenue	0.032
having health insurance^*	0.063	five guarantee subsidized^*	−0.264
consumption willingness	0.023	CPI forecast	−0.044
having credit card(s)^*	0.127	shopping payment by debit card^*	0.055
paying credit card bill on time^*	0.110	amount of rent received	0.016
having other loan beyond car loan and mortgage^*	−0.211	income in last year	0.597
current market value of stocks owned	0.146	having non-RMB assets^*	0.009
leasing out a vehicle^*	0.065	amount lent to others	0.186
making money from stock market^*	0.063	number of funds held	0.019
expenditure higher than normal^*	−0.119	monthly loan payment	−0.057
having financial derivatives^*	0.027	income growing faster than CPI^*	0.061
expenditure growing faster than CPI^*	0.089	having stock account^*	0.132
having foreign currency deposit^*	0.118	having deposit in saving account^*	0.189
having deposit in checking account^*	0.165	time invested in funds	0.054

Open in a new tab

: binary variable, “no” is the baseline.

References

1.Breiman L. Random forest. Machine learning. 2001;45:5–32. [Google Scholar]
2.Bickel J, Gotze F, Zwet W. Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica. 1997;7:1–31. [Google Scholar]
3.Cevher V, Becker S, Schmidt M. Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics. Signal Processing Magazine, IEEE. 2014;31:32–43. [Google Scholar]
4.Dunn JC. Well separated clusters and fuzzy partitions. Journal on Cybernetics. 1974;4:95–104. [Google Scholar]
5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of statistics. 2004;32:407–499. [Google Scholar]
6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]
7.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]
9.Fan J, Han F, Liu H. Challenges of Big Data analysis. National Science Review. 2014;1:293–314. doi: 10.1093/nsr/nwt032. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Gan L, Yin Z, Jia N, Xu S, Ma S, Zheng L. Data you need to know about China: Research Report of China Household Finance Survey 2012. Springer: 2013. [Google Scholar]
11.Handl J, Knowles J, Kell DB. Computation cluster validation in post-genomic data analysis. Bioinformatics. 2005;21:3201–3212. doi: 10.1093/bioinformatics/bti517. [DOI] [PubMed] [Google Scholar]
12.Huang J, Ma S, Zhang C. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]
13.Jacobs A. The pathologies of big data. Communications of the ACM. 2009;52:36–44. [Google Scholar]
14.Janitza S, Binder H, Boulesteix AL. Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications. Biometrical Journal. 2016;58(3):447–473. doi: 10.1002/bimj.201400246. [DOI] [PubMed] [Google Scholar]
15.Kleiner A, Talwalkar A, Sarkar P, Jordan M. A scalable boot-strap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:795–816. [Google Scholar]
16.Li L, Dennis Cook R, Nachtsheim CJ. Model free variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:285–299. [Google Scholar]
17.Meinshausen N, Buhlmannn P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2010;72:417–473. [Google Scholar]
18.Politis D, Romano J, Wolf M. Subsampling. Springer: 1999. [Google Scholar]
19.Richtarik P, Takac M. Parallel coordinate descent methods for big data optimization. arXiv preprint arXiv:1212.0873 2012 [Google Scholar]
20.Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational solutions to large-scale data management and analysis. Nature Reviews Genetics. 2010;11:647–657. doi: 10.1038/nrg2857. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Scheetz TE, Kim K-YA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B. 1996;58:267–288. [Google Scholar]
23.Wang S, Nan B, Rosset S, Zhu J. Random Lasso. The Annals of Applied Statistics. 2010;5:468–485. doi: 10.1214/10-AOAS377. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics. 2010;38:894–942. [Google Scholar]
25.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:301–320. [Google Scholar]
26.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

[R1] 1.Breiman L. Random forest. Machine learning. 2001;45:5–32. [Google Scholar]

[R2] 2.Bickel J, Gotze F, Zwet W. Resampling fewer than n observations: Gains, losses, and remedies for losses. Statistica Sinica. 1997;7:1–31. [Google Scholar]

[R3] 3.Cevher V, Becker S, Schmidt M. Convex Optimization for Big Data: Scalable, randomized, and parallel algorithms for big data analytics. Signal Processing Magazine, IEEE. 2014;31:32–43. [Google Scholar]

[R4] 4.Dunn JC. Well separated clusters and fuzzy partitions. Journal on Cybernetics. 1974;4:95–104. [Google Scholar]

[R5] 5.Efron B, Hastie T, Johnstone I, Tibshirani R. Least angle regression. The Annals of statistics. 2004;32:407–499. [Google Scholar]

[R6] 6.Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96:1348–1360. [Google Scholar]

[R7] 7.Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2008;70:849–911. doi: 10.1111/j.1467-9868.2008.00674.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Fan J, Song R. Sure independence screening in generalized linear models with NP-dimensionality. The Annals of Statistics. 2010;38:3567–3604. [Google Scholar]

[R9] 9.Fan J, Han F, Liu H. Challenges of Big Data analysis. National Science Review. 2014;1:293–314. doi: 10.1093/nsr/nwt032. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Gan L, Yin Z, Jia N, Xu S, Ma S, Zheng L. Data you need to know about China: Research Report of China Household Finance Survey 2012. Springer: 2013. [Google Scholar]

[R11] 11.Handl J, Knowles J, Kell DB. Computation cluster validation in post-genomic data analysis. Bioinformatics. 2005;21:3201–3212. doi: 10.1093/bioinformatics/bti517. [DOI] [PubMed] [Google Scholar]

[R12] 12.Huang J, Ma S, Zhang C. Adaptive lasso for sparse high-dimensional regression models. Statistica Sinica. 2008;18:1603–1618. [Google Scholar]

[R13] 13.Jacobs A. The pathologies of big data. Communications of the ACM. 2009;52:36–44. [Google Scholar]

[R14] 14.Janitza S, Binder H, Boulesteix AL. Pitfalls of hypothesis tests and model selection on bootstrap samples: Causes and consequences in biometrical applications. Biometrical Journal. 2016;58(3):447–473. doi: 10.1002/bimj.201400246. [DOI] [PubMed] [Google Scholar]

[R15] 15.Kleiner A, Talwalkar A, Sarkar P, Jordan M. A scalable boot-strap for massive data. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2014;76:795–816. [Google Scholar]

[R16] 16.Li L, Dennis Cook R, Nachtsheim CJ. Model free variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:285–299. [Google Scholar]

[R17] 17.Meinshausen N, Buhlmannn P. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2010;72:417–473. [Google Scholar]

[R18] 18.Politis D, Romano J, Wolf M. Subsampling. Springer: 1999. [Google Scholar]

[R19] 19.Richtarik P, Takac M. Parallel coordinate descent methods for big data optimization. arXiv preprint arXiv:1212.0873 2012 [Google Scholar]

[R20] 20.Schadt EE, Linderman MD, Sorenson J, Lee L, Nolan GP. Computational solutions to large-scale data management and analysis. Nature Reviews Genetics. 2010;11:647–657. doi: 10.1038/nrg2857. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] 21.Scheetz TE, Kim K-YA, Swiderski RE, Philp1 AR, Braun TA, Knudtson KL, Dorrance AM, DiBona GF, Huang J, Casavant TL, Sheffield VC, Stone EM. Regulation of gene expression in the mammalian eye and its relevance to eye disease. Proceedings of the National Academy of Sciences. 2006;103:14429–14434. doi: 10.1073/pnas.0602562103. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Tibshirani R. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B. 1996;58:267–288. [Google Scholar]

[R23] 23.Wang S, Nan B, Rosset S, Zhu J. Random Lasso. The Annals of Applied Statistics. 2010;5:468–485. doi: 10.1214/10-AOAS377. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Zhang CH. Nearly unbiased variable selection under minimax concave penalty. The Annals of statistics. 2010;38:894–942. [Google Scholar]

[R25] 25.Zou H, Hastie T. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67:301–320. [Google Scholar]

[R26] 26.Zou H. The adaptive lasso and its oracle properties. Journal of the American Statistical Association. 2006;101:1418–1429. [Google Scholar]

PERMALINK

Analyzing Large Datasets with Bootstrap Penalization

Kuangnan Fang

Shuangge Ma

Abstract

1 Introduction

2 Data and model settings

3 Methods

Figure 1.

3.1 Bootstrap penalization for data with a large p

Step 1

Step 2

Step 3

Remarks

3.2 Bootstrap penalization for data with a large n

3.3 Bootstrap penalization for data with a large p and a large n

3.4 R software development

4 Simulation

4.1 Simulation settings

Example 1

Example 2

Example 3

Example 4

Example 5

Example 6

4.2 Results

Table 1.

Table 2.

Table 3.

5 Data analysis

5.1 Data descriptions

Mice gene expression data

Diabetes data

CHFS data

5.2 Analysis results

Table 4.

6 Discussion

Acknowledgments

Appendix

Table 5.

Table 6.

Table 7.

Table 8.

Table 9.

Table 10.

Table 11.

Table 12.

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases