Skip to main content
PLOS One logoLink to PLOS One
. 2022 Dec 1;17(12):e0278570. doi: 10.1371/journal.pone.0278570

Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

Jongkwon Jo 1, Seungha Jung 1, Joongyang Park 1, Youngsoon Kim 1,*, Mingon Kang 2,*
Editor: Sathishkumar V E3
PMCID: PMC9714948  PMID: 36455001

Abstract

High-dimensional LASSO (Hi-LASSO) is a powerful feature selection tool for high-dimensional data. Our previous study showed that Hi-LASSO outperformed the other state-of-the-art LASSO methods. However, the substantial cost of bootstrapping and the lack of experiments for a parametric statistical test for feature selection have impeded to apply Hi-LASSO for practical applications. In this paper, the Python package and its Spark library are efficiently designed in a parallel manner for practice with real-world problems, as well as providing the capability of the parametric statistical tests for feature selection on high-dimensional data. We demonstrate Hi-LASSO’s outperformance with various intensive experiments in a practical manner. Hi-LASSO will be efficiently and easily performed by using the packages for feature selection. Hi-LASSO packages are publicly available at https://github.com/datax-lab/Hi-LASSO under the MIT license. The packages can be easily installed by Python PIP, and additional documentation is available at https://pypi.org/project/hi-lasso and https://pypi.org/project/Hi-LASSO-spark.

Introduction

The Least Absolute Shrinkage and Selection Operator (LASSO) and its derivatives have been widely used as powerful linear regression-based feature selection tools that identify a subset of relevant variables in model construction [1]. LASSO’s major derivatives include ElasticNet [2], Adaptive LASSO [3], Relaxed LASSO [4], and Precision LASSO [5], as well as bootstrapping-based LASSOs, such as Random LASSO [6] and Recursive Random LASSO [7]. LASSO is a popular feature selection approach for high-dimensional data in various fields, such as Internet of Things, social media, and engineering research [8, 9].

We recently proposed a high-dimensional LASSO (Hi-LASSO) that theoretically improves the predictive power and feature selection performance on High-Dimension, Low Sample Size (HDLSS) data [10]. Hi-LASSO (1) alleviates bias introduced from bootstrapping, (2) satisfies the global oracle property, and (3) provides a Parametric Statistical Test for Feature Selection in Bootstrap regression modeling (PSTFSboot). However, despite of the outstanding performance of Hi-LASSO in feature selection, the substantial cost of bootstrapping and the lack of experiments for a parametric statistical test for feature selection have impeded practical applications of Hi-LASSO.

In this paper, we introduce Hi-LASSO packages implemented in Python and Apache Spark, which improve the efficiency in a parallel manner, as scalable and practical tools for feature selection with HDLSS data. We assessed Hi-LASSO’s outstanding performance in feature selection with PSTFSboot, which was not thoroughly explored in the original paper due to the expensive computational cost from extensive bootstrapping. We also provide insights for the optimal hyper-parameter settings from various simulation experiments.

Materials and methods

Hi-LASSO is a linear regression-based feature selection model that produces outstanding performance in both prediction and feature selection on high-dimensional data, by theoretically improving Random LASSO. We first introduce Random LASSO and its limitations. Then, we present how Hi-LASSO improves the LASSO model in this section.

Random LASSO and its limitations

Random LASSO introduced bootstrapping for robust analysis with high-dimensional data. Random LASSO consists of two procedures of bootstrapping [6]. The first procedure computes importance scores of predictors while approximating weights of variables by drawing multiple bootstrap samples. The second procedure estimates coefficients of the linear model on bootstrapping samples with weights of the importance scores, where predictors having higher importance scores have higher chances to be selected than lower ones. Then, the final estimations of the coefficients are computed by taking the average of the multiple estimates from the bootstrapping. Random LASSO deals with multicollinearity of different signs and identifies non-zero variables more than the sample size.

However, there are still several issues to improve. First, Random LASSO sets coefficients of predictors to zeros, even though the predictors are never selected in the bootstrapping. The unselected predictors may be possibly estimated as non-zero coefficients if being selected in the bootstrapping. Therefore, it introduces a systematic bias regardless its importance of the predictors. Moreover, the lower number of bootstrapping in Random LASSO generates the more systematic bias, where there may be more variables never selected. Note that the bootstrapping number directly affect computational costs in Random LASSO. Second, Random LASSO does not take advantage of global oracle property. Although Random LASSO uses bootstrapping with weights being proportional to importance scores of predictors in the second procedure, the final coefficients are estimated without the weights. Random LASSO may adopt adaptive LASSO to take the oracle property. However, Adaptive LASSO takes local weights of each bootstrapping in Random LASSO, where the local oracle property may vary depending on which predictors are involved in the bootstrapping. Finally, Random LASSO does not provide a statistical test to identify a set of features from multiple bootstrapping results. Random LASSO considers a heuristic threshold, for example, the reciprocal of the sample size, without statistical test, although the results of the feature selection substantially depend on the threshold.

Hi-LASSO improves Random LASSO

Hi-LASSO tackles the aforementioned limitations of Random LASSO. The contributions of Hi-LASSO are follows. (1) Hi-LASSO rectifies the systematic bias that Random LASSO introduces, by refining the process to compute importance scores. To prevent the systematic bias, Hi-LASSO considers the coefficient estimation of the unselected predictors as missing values on the bootstrapping in the procedures. (2) Hi-LASSO computes importance scores of variables by averaging absolute coefficients. The coefficient of a predictor may be assigned a different value or opposite sign with its estimate in different linear models with other predictors, specifically multicollinearity with different signs may often cause coefficient estimates of different signs over bootstrapping. Therefore, taking the absolute value of the sum of the coefficient estimates of bootstrapping in Random LASSO may reduce the importance score. (3) Hi-LASSO provides a statistical strategy to determine the number of bootstrapping. The determination of the number of bootstrap samples is crucial to ensure the performance for high-dimensional data. Some predictors may never be considered due to the nature of random sampling no matter how important they are in the model. Thus, Hi-LASSO considered a sufficient number of bootstrap for all predictors to be taken into account. (4) Hi-LASSO takes advantage of global oracle property by adopting Adaptive LASSO [3] in the second procedure. (5) Hi-LASSO uses parametric statistical tests for feature selection in bootstrap regression modeling (PSTFSboot). PSTFSboot allows Hi-LASSO to robustly perform feature selection from multiple bootstrapping results, as a filter feature selection, while most LASSO models are wrapper-based feature selection [9].

Hi-LASSO packages

We provide efficient solutions for performing Hi-LASSO with high-dimensional data, as Python and Apache Spark packages. We reduce the high-cost computations from a number of independent bootstrapping by using parallel processing. We also improve the scalability of Hi-LASSO by implementing the algorithm based on the Apache Spark engine. The Python package for Hi-LASSO (https://pypi.org/project/hi-lasso) and its Apache Spark version (https://pypi.org/project/hi-lasso-spark) are available through PyPI and can be easily installed using Python PIP:

pip install hi-lasso // installation in Python

pip install hi-lasso-spark // installation in Apache Spark (Spark 3.0.0+)

Sample codes and troubleshooting guide are provided in the web pages (https://hi-lasso.readthedocs.io/en/latest/). Hi-LASSO includes the following hyper-parameters: ‘q1’, ‘q2’, ‘L’, ‘alpha’, ‘logistic’, ‘random_state’, and ‘n_jobs’. In Hi-LASSO, each procedure repeats a random selection of qi predictors B times, where the subscript i denotes the first or second procedure, i ∈ {1, 2}. Smaller qi causes more bootstrapping, which requires more computation cost. On the other hand, large qi may make coefficient estimation less accurate, because the random selection of a large number of predictors is likely to include more multicollinearity. B is determined by L, which is a desired average number of times that a predictor is selected during the bootstrapping. At the end of the second procedure, the parametric statistical test for feature selection is performed with a threshold alpha (e.g., 0.01 or 0.05). logistic indicates if the model is based on a logistic regression model for binary classification or a linear regression model for regression problems. n_jobs sets the number of cores for parallel processing.

Results

We conducted intensive experiments to assess the performance of the Hi-LASSO packages with the parametric statistical test (PSTFSboot): (1) We performed a simulation study by generating various dimensional data, where feature selection performance and efficiency were assessed; (2) We assessed Hi-LASSO by using semi-real datasets based on TCGA cancer data and analyzed hyper-parameter settings and performance in practice; and (3) We further assessed the robustness of Hi-LASSO.

Simulation study

We assessed the feature selection performance of Hi-LASSO with PSTFSboot by comparing with current state-of-the-art LASSO methods, including LASSO, ElasticNet, Adaptive, Relaxed, Random, Recursive, and Precision. We generated synthetic data with six different scenarios: Dataset I—Dataset VI, where numbers of predictors and samples are varied and the ground truth of relevant features are known (S1 File). We measured F1-scores, where relevant variables (|β|>0) were considered as positive, and irrelevant variables (|β| = 0) as negative in a confusion matrix. F1-scores show how accurately a model identifies the set of relevant features as feature selection [11]. Note that the original paper of Hi-LASSO assessed the feature selection performance using F1-scores by a threshold that maximizes the Root Mean Square Error (RMSE) of the validation data without a parametric statistical test, whereas this study conducted the experiments with further feature selection process that statistically combines bootstrapping results (i.e., using PSTFSboot), which does not require validation data.

We tuned the hyper-parameters of the benchmark models. We optimized the hyper-parameters of Precision and Relaxed LASSO as their original papers proposed. For Hi-LASSO, we set L as 30 and q1 and q2 as the sample size. We set the hyper-parameters of q and B in Random LASSO and Recursive LASSO as Hi-LASSO did for the fair comparison. For the other benchmark models, the optimal hyper-parameters of L1 or L2-norm regularization (λ) were obtained to minimize the prediction error with inner 5-fold cross validation in the training.

We repeated the experiments ten times by randomly generating simulation data for reproducibility. The experimental results are shown in Table 1 and Fig 1A. Overall, Hi-LASSO outperformed the benchmark models on most of the datasets. Hi-LASSO produced the highest F1-scores in all experiments except Dataset III. However, when sample sizes are increased, Hi-LASSO constantly showed superior performance, at least 10–20% higher F1-scores, for feature selection with high-dimensional data.

Table 1. Experimental results for feature selection performance (F1-scores) with the simulation data.

Dataset LASSO ElasticNet Adaptive Relaxed Random Recursive Precision Hi-LASSO
Dataset I (p = 100, n = 50) 0.6494±0.069 0.6837±0.082 0.6283±0.080 0.5446±0.190 0.5856±0.051 0.4118±0.084 0.6455±0.232 0.7256±0.073
Dataset II (p = 100, n = 100) 0.5582±0.179 0.5648±0.179 0.6171±0.116 0.5302±0.235 0.4956±0.063 0.4724±0.056 0.5457±0.166 0.8110±0.034
Dataset III (p = 1,000, n = 100) 0.1343±0.051 0.2730±0.203 0.1089±0.038 0.1077±0.047 0.3092±0.048 0.0827±0.038 0.2366±0.160 0.2697±0.089
Dataset IV (p = 1,000, n = 200) 0.5236±0.054 0.5494±0.073 0.4659±0.046 0.4944±0.043 0.4813±0.033 0.1446±0.080 0.6289±0.188 0.8406±0.037
Dataset V (p = 10,000, n = 200) 0.3489±0.039 0.3660±0.051 0.2122±0.063 0.2882±0.050 0.1657±0.018 0.0050±0.011 0.4003±0.306 0.7117±0.021
Dataset VI (p = 10,000, n = 400) 0.3224±0.033 0.3361±0.030 0.2497±0.070 0.2958±0.034 0.1132±0.004 0.0330±0.029 0.7320±0.085 0.8295±0.039

The highest F1-scores are highlighted in bold. p is a number of variables, and n is a sample size.

Fig 1. Experimental results.

Fig 1

(A) Comparison of F1-scores with the simulation data, (B) improvement of efficiency on the Hi-LASSO Python package and the Spark version, (C) comparison of F1-scores with the semi-real simulation data, and (D) robustness of feature selection.

We also assessed the efficiency of the Hi-LASSO packages. Fig 1B shows the improved speedup of Hi-LASSO on a parallel processing in Python and Spark, comparing to Hi-LASSO’s implementation in the original paper, using a large-scale simulated data (Dataset V) on a machine with Intel Xeon Gold (24 cores × 2). The experimental results show that the packages in Python and Apache Spark enhanced the speedup to 3.75 and 4.83 times faster, respectively. The Spark version was approximately two times faster than the Python version. The details of the execution time are in S2 File.

Semi-real simulation study based on TCGA cancer data

We further conducted a semi-real simulation study based on cancer data, including Glioblastoma Multiforme (GBM), Low Grade Gliomas (LGG), breast cancer (BRCA), and ovarian cancer (OV) in The Cancer Genome Atlas Program (TCGA) repository (https://www.cancer.gov/tcga). We downloaded the cancer genomic data from https://www.cbioportal.org. For the semi-real simulation study, we used gene expression (e.g., microarray or RNA-seq) as independent variables and generated a response variable as follows: (1) We performed correlation analysis between survival months and gene expression; (2) We selected 20 genes with the highest correlation with survival months; (3) The regression coefficients (β^) of the 20 genes were randomly generated from the normal distribution, N(μ=4, σ2=1), preserving the sign of the correlation coefficient corresponding to the regression coefficient; (4) The coefficients of the other genes were set to 0; and (5) The synthetic survival months (the response variable) were generated from the linear combination of the gene expression (X) and the ground truth’s coefficients (β^) and errors (ϵ) from the normal distribution with a mean of zero and the standard deviation of the logarithmic survival months, i.e., y=Xβ^+ϵ.

We then calculated F1-scores to evaluate the performance of feature selection with the semi-real data (Table 2 and Fig 1C). The experimental results showed that Random, Recursive, and Precision produced F1-scores close to zeros, whereas LASSO, ElasticNet, and Adaptive showed F1-scores between 0.0 and 0.4. Remarkably, Hi-LASSO presented F1-scores between 0.5 and 0.72. The numbers of non-zero variables identified by the benchmark methods are shown in the parentheses in Table 2. Hi-LASSO constantly identified 10∼17 relevant variables, which is closed to the 20 non-zero variables in the ground truth. Whereas, Random and Precision tended to select large numbers of variables (>1,000 non-zeros), which caused recall relatively higher but precision closed to zero. It may be because Random LASSO uses a reciprocal of sample size as a threshold for feature selection, and Precision LASSO is optimized to regression problems rather than feature selection. Recursive LASSO tends to introduce extreme bias in the first bootstrapping for feature selection, which makes it often fail to identify relevant features. LASSO, ElasticNet, Adaptive, and Relaxed showed unstable performance with large variance on the numbers of non-zero variables in the semi-real data.

Table 2. Experimental results for feature selection performance (F1-scores and numbers of non-zero variables in parentheses) with the semi-simulated data based on four TCGA cancer datasets.

The average of the experiments are shown, where bold-face indicates the best performance.

Dataset LASSO ElasticNet Adaptive Relaxed Random Recursive Precision Hi-LASSO
BRCA 0.0000±0.000 0.0000±0.000 0.0000±0.000 0.2380±0.050 0.0058±0.001 0.0000±0.000 0.0078±0.001 0.6165±0.067
(p = 20,212, n = 1,099) (0.1±0.3) (0.1±0.3) (1.0±0.0) (144.9±37.6) (5286.9±222.9) (139.5±16.1) (1689.4±28.0) (17.8±3.2)
GBM 0.2160±0.027 0.2107±0.032 0.2515±0.057 0.3695±0.038 0.0151±0.000 0.0395±0.003 0.0134±0.001 0.7231±0.046
(p = 12,042, n = 524) (167.8±22.9) (173.8±28.9) (143.0±43.5) (88.8±11.8) (2445.8±35.5) (82.0±8.5) (1825.2±216.9) (12.6±1.1)
LGG 0.3995±0.242 0.3815±0.230 0.3412±0.281 0.4341±0.088 0.0101±0.001 0.0000±0.000 0.0050±0.001 0.5013±0.047
(p = 20,167, n = 529) (34.5±29.2) (41.3±28.2) (43.2±49.8) (68.8±22.4) (3221.0±75.8) (80.6±10.5) (2893.0±626.5) (13.6±2.3)
OV 0.2530±0.038 0.2389±0.024 0.2985±0.064 0.5120±0.036 0.0151±0.001 0.0311±0.003 0.0273±0.005 0.7213±0.054
(p = 12,042, n = 532) (141.6±26.3) (149.2±19.0) (118.5±30.7) (58.5±5.9) (2554.9±79.4) (109.9±14.4) (1491.0±303.9) (16.3±2.4)

Tuning the hyper-parameters

The optimization of the hyper-parameters, q1, q2, and L, is often critical to the performance of feature selection in Hi-LASSO. We investigated how the hyper-parameters affect the performance of Hi-LASSO using the two simulation data (S3 File). We compared F1-scores by varying values of the hyper-parameters, where we set the identical values for q1 and q2 (i.e., q = q1 = q2) for the sake of simplicity. We empirically found that the optimal values of q were around of the sample size. Generally, the larger L improved the performance in the experiments. However, L > 50 does not improve the performance significantly in the experiments. Empirically, the optimal value of L was 30, which approximates normal distribution by the central limit theorem, when the distribution is unknown.

Robustness for feature selection

Finally, we evaluated the Hi-LASSO’s robustness on the feature selection by calculating pairwise Kuncheva Index (KI) on the semi-real simulation datasets [12]. Specifically, Hi-LASSO, Random, and Recursive LASSO are based on bootstrapping, which may produce different results on every execution. KI calculates how much two sets are overlapped, where one indicates two sets are identical, and zero shows no overlap between the two sets. Hi-LASSO showed KI of 0.751 on average on the four cancer semi-real data, whereas Random and Recursive showed 0.650 and 0.733, respectively (Fig 1D). Note that Hi-LASSO produced the highest F1-scores on the semi-real simulation data. The best F1-scores and robustness with the highest KIs demonstrated reliable feature selection performance of Hi-LASSO.

Conclusion

We introduce Hi-LASSO packages in Python and Apache Spark, which improve the efficiency and scalability for feature selection with high-dimensional data. The Hi-LASSO packages can be easily installed by Python PIPs and can efficiently and effectively analyze high-dimensional data. We demonstrated the extraordinary performance of feature selection with a parametric statistical test through intensive simulation studies and provide insight how to tune the hyper-parameters in this paper. Hi-LASSO is a promising feature selection tool applicable to practice on real world data.

Supporting information

S1 File. Simulation study.

(PDF)

S2 File. Performance for efficiency.

(PDF)

S3 File. Tuning the hyper-parameters in Hi-LASSO.

(PDF)

S4 File. Robustness analysis using Kuncheva Index (KI).

(PDF)

Data Availability

All relevant data are within the paper and its Supporting information files. Hi-LASSO packages are publicly available on GitHub at https://github.com/datax-lab/Hi-LASSO under the MIT license. The packages can be easily installed by Python PIP, and additional documentation is available at https://pypi.org/project/hi-lasso and https://pypi.org/project/Hi-LASSO-spark.

Funding Statement

Y.S. Kim is supported for the work by the National Research Foundation of Korea (NRF-2021R1I1A3048029).

References

  • 1. Emmert-Streib Frank, and Matthias Dehmer. High-dimensional LASSO-based computational regression models: Regularization, shrinkage, and selection. Machine Learning and Knowledge Extraction 1.1 (2019): 359–383. doi: 10.3390/make1010021 [DOI] [Google Scholar]
  • 2. Zou Hui, and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology) 67.2 (2005): 301–320. doi: 10.1111/j.1467-9868.2005.00503.x [DOI] [Google Scholar]
  • 3. Zou Hui. The adaptive lasso and its oracle properties. Journal of the American statistical association 101.476 (2006): 1418–1429. doi: 10.1198/016214506000000735 [DOI] [Google Scholar]
  • 4.Meinshausen Nicolai. Relaxed lasso. Computational Statistics and Data Analysis 52.1 (2007): 374–393. [Google Scholar]
  • 5. Wang Haohan, et al. Precision Lasso: accounting for correlations and linear dependencies in high-dimensional genomic data. Bioinformatics 35.7 (2019): 1181–1187. doi: 10.1093/bioinformatics/bty750 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Wang Sijian, et al. Random lasso. The annals of applied statistics 5.1 (2011): 468. doi: 10.1214/10-AOAS377 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Park Heewon, Seiya Imoto, and Satoru Miyano. Recursive random lasso (RRLasso) for identifying anti-cancer drug targets. PLoS One 10.11 (2015): e0141869. doi: 10.1371/journal.pone.0141869 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8. Wang Chen, and Liu. Establish algebraic data-driven constitutive models for elastic solids with a tensorial sparse symbolic regression method and a hybrid feature selection technique. Journal of the mechanics and physics of Solid, 2022. [Google Scholar]
  • 9. Subbiah Siva Sankari, and Jayakumar Chinnappan. Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review. Ingénierie des Systèmes d’Information 26.1 (2021). [Google Scholar]
  • 10. Kim Youngsoon, et al. Hi-lasso: High-dimensional lasso. IEEE Access 7 (2019): 44562–44573. doi: 10.1109/ACCESS.2019.2909071 [DOI] [Google Scholar]
  • 11. Bolón-Canedo Verónica, and Amparo Alonso-Betanzos. Ensembles for feature selection: A review and future trends. Information Fusion 52 (2019): 1–12. doi: 10.1016/j.inffus.2018.11.008 [DOI] [Google Scholar]
  • 12. Lustgarten Jonathan L., Vanathi Gopalakrishnan, and Shyam Visweswaran. Measuring stability of feature selection in biomedical datasets. AMIA annual symposium proceedings. Vol. 2009. American Medical Informatics Association, 2009. [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Sathishkumar V E

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

28 Jul 2022

PONE-D-22-19015Hi-LASSO: High-performance Python and Apache spark packages for feature selection with high-dimensional dataPLOS ONE

Dear Dr. Kang,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Sep 11 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Sathishkumar V E

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match. 

When you resubmit, please ensure that you provide the correct grant numbers for the awards you received for your study in the ‘Funding Information’ section.

3. Thank you for stating the following in the Acknowledgments Section of your manuscript: 

"This research was supported by the National Research Foundation of Korea (NRF-2021R1I1A3048029). "

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. 

Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows: 

"Y.S. Kim is supported for the work by the National Research Foundation of Korea (NRF-2021R1I1A3048029)."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information.

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: No

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The Research Paper stands Rejected and is NOT RECOMMENDED for Publication because of the following strong reasons:

1. The overall presentation and conceptual methodology of the paper is very weak and lots of advanced papers are already published.

2. No Strong analysis and experimental results are observed in the paper.

3. No Novelty is there.

4. It is the work of simple theoretical description but even the actual research orientation is missing in the paper.

Reviewer #2: Few experiments can be repeated or justified for f1 scores. The literature study can strengthen with more recent papers. The authors can state how the current standards are maintained, materials and methods are not cited with previous works. the authors can consider the below works for better literature

-Y. Lu, L. Yang, S. X. Yang, Q. Hua, A. K. Sangaiah, T. Guo, and K. Yu, “An Intelligent Deterministic Scheduling Method for Ultra-Low Latency Communication in Edge Enabled Industrial Internet of Things,” IEEE Transactions on Industrial Informatics, 2022, doi: 10.1109/TII.2022.3186891.

J. Wei, Q. Zhu, Q. Li, L. Nie, Z. Shen, K. -K. R. Choo, K. Yu, “A Redactable Blockchain Framework for Secure Federated Learning in Industrial Internet-of-Things”, IEEE Internet of Things Journal, doi: 10.1109/JIOT.2022.3162499.

-Subbiah, S.S. and Chinnappan, J., 2021. Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review. Ingénierie des Systèmes d'Information, 26(1).

-Bolón-Canedo, V. and Alonso-Betanzos, A., 2019. Ensembles for feature selection: A review and future trends. Information Fusion, 52, pp.1-12.

-Y. He, L. Nie, T. Guo, K. Kaur, M. M. Hassan, and K. Yu," A NOMA-Enabled Framework for Relay Deployment and Network Optimization in Double-Layer Airborne Access VANETs," IEEE Transactions on Intelligent Transportation Systems, doi: 10.1109/TITS.2021.3139888.

Reviewer #3: This paper presents a new implementation of a previously published algorithm called Hi-LASSO, with parallel computations that make the algorithm more practical for use with large real-world data sets. It shows experiments on both synthetic and real data that demonstrate the algorithm's utility for feature selection in high-dimensional data sets. Comparisons to a Spark implementation are shown, with performance results indicating the scalability of the method. Finally, the work describes the model's hyperparameters and robustness. I found the paper to be relatively well written overall, with a reasonable order of its sections. This paper will be a good candidate for PLOS ONE with some or all of the revisions suggested below.

I have broken my comments into a few sections focused on the paper, figures, grammar, reproducibility, references, and the code described in the paper.

Paper comments:

- The previous paper on this algorithm used Relative Model Error, Root Mean Square Error, and F1 scores. Why are only F1 scores reported in this work? An explanation of the choice of metric would help strengthen the data.

- The hyperparameters q1, q2, L, alpha should be described in further detail. These are described a little bit in the "tuning" section. However, it would be helpful to know not only the trends in how performance is affected, but also how to choose an initial value for each. It appears there is an "auto" setting in the Python package but that automatic behavior is not described in the paper from what I could tell.

- Were hyperparameters optimized for all LASSO algorithms? How did the authors ensure that all algorithms were fairly assessed? It is surprising to see so many algorithms with F1 scores of zero in the BRCA dataset. Similarly, it is surprising to see the results in Table S5. Is there another dataset that shows a nonzero score for some of the compared algorithms?

- I find it a little hard to believe that Hi-LASSO is this much better than similar algorithms without more information about how each algorithm was run, to ensure fairness in the assessment. Are there cases where Hi-LASSO performs poorly? If so, it would be helpful to include such a case for a baseline. How does Hi-LASSO perform in lower-dimensional cases with more data where other LASSO algorithms have been used in the past? Comparisons like this would help reduce the sense that the datasets are cherry-picked for Hi-LASSO's benefit, and would help to illuminate the contrast between prior art and this algorithm's improvements for specific types of problems.

- Some of the results are a bit surprising, with several comparison methods yielding few or no positive results. This may indicate the selection of overly specific benchmark data sets, or a lack of competitive algorithms for comparison. A bit more explanation of the results in these areas would benefit the reader as well as make the work more defensible. The authors' claim of "extraordinary performance" appears to be somewhat supported by the data that is presented, but it is a little unclear whether this is due to a selective choice of benchmarks. Understanding where the algorithm fails (or performs in an "average" way) is important for readers who wish to make practical use of the package.

- The introduction or conclusions should spend more time contextualizing this algorithm. What fields should consider adopting Hi-LASSO? Genomics may be one such candidate, but other potential applications should be described.

- It would be good to summarize the contributions of each author to the work, perhaps using a standardized framework like CRediT (Contributor Roles Taxonomy).

Figure comments:

- Figure 1 is hard to read and should be higher resolution - ideally a vector graphic format like PDF or EPS. Same for supplementary figures S1, S2.

- Figure 1(B) could be replaced by a scatter plot showing weak scaling performance for the process parallel and Spark implementations from 1 core to the number of cores in the benchmarking machine. This could be for one dataset, or a geometric average of a few datasets. Weak scaling plots are far more useful to understand computational efficiency than a raw speedup chart with no clear baseline. It's not clear if the speedup is linear with the number of cores, which a weak scaling plot would help indicate.

Grammatical / typographical comments:

- Line 24: "impeded to apply Hi-LASSO for practical applications" should say "impeded practical applications of Hi-LASSO"

- Line 111: should say "desired average number of times"

- Line 156: "the Apache version" should say "the Apache Spark version." Apache Spark (or Spark for short) is the proper name of the library -- not just "Apache."

- Line 198: missing a subscript on q1

References / reproducibility comments:

- The TCGA data sets should be cited. https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga/using-tcga/citing-tcga

- In accordance with the PLOS ONE "Exceptions to sharing materials" (https://journals.plos.org/plosone/s/materials-software-and-code-sharing), the "authors should include a statement in their Materials and Methods discussing any restrictions on availability or use." It appears the TCGA data is subject to controlled access. This should be made clear to the reader, with information about how to access these controlled datasets (if possible) in order to make the results reproducible.

- The code used to generate synthetic Datasets I - IV does not appear to be included in the linked GitHub repository (I looked in the benchmark models and sample data directories). That should be included to meet PLOS ONE data sharing policies, along with a script to execute the code in the benchmark models directory for all benchmarks on the synthetic data.

- Check the capitalization of journal names and article titles in the references section. Some have unexpected lowercase letters.

- Please cite all relevant scientific software packages used in the hi_lasso software, such as NumPy and SciPy. See https://numpy.org/citing-numpy/ and https://scipy.org/citing-scipy/ for examples.

Code comments:

- Line 116 of the paper: Rather than describing both "parallel" and "n_jobs", just let "n_jobs" default to 1 (the serial case). Then only one parameter is needed, and "parallel" can be removed. A special value of "n_jobs is None" or "n_jobs == 0" could use the number of CPU cores returned by "multiprocessing.cpu_count()" for automatic parallelization across all available cores.

- The choice of the MIT license is good for future works to build on this one!

- Could the Spark and non-Spark libraries be combined, or make the Spark library use the base Python library as a dependency? The two code paths look fairly unrelated right now.

- The "simulation_data" folder on GitHub could include a README that indicates where the data came from or how it was generated.

It is my hope that the authors will consider adapting this algorithm for inclusion in a popular toolkit such as scikit-learn after publication. It seems like a helpful algorithm.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2022 Dec 1;17(12):e0278570. doi: 10.1371/journal.pone.0278570.r002

Author response to Decision Letter 0


17 Nov 2022

We attached the response letter, but the below is a text version:

Editor

[Concern #1] Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming.

Response: We carefully checked PLOS ONE’s style requirements, including file naming. We separately uploaded figures and supplementary documents.

[Concern #2] We note that the grant information you provided in the ‘Funding Information’ and ‘Financial Disclosure’ sections do not match.

Response: We correctly updated the grant information in the submission system.

[Concern #3] Thank you for stating the following in the Acknowledgments Section of your manuscript:

"This research was supported by the National Research Foundation of Korea (NRF-2021R1I1A3048029). "

We note that you have provided funding information that is not currently declared in your Funding Statement. However, funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript and let us know how you would like to update your Funding Statement. Currently, your Funding Statement reads as follows:

"Y.S. Kim is supported for the work by the National Research Foundation of Korea (NRF-2021R1I1A3048029)."

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

Response: We appreciate it. We removed the funding sentence in the acknowledgments and add it in the online submission form.

[Concern #4] Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly.

Response: We added the supporting information at the end of the manuscript, as requested.

We moved several sections from the supplementary to the main manuscript for “package installation” in Page 3,

[Concern #5] Please review your reference list to ensure that it is complete and correct.

Response: We carefully checked the reference and updated all incorrect ones.

Reviewer # 1

[Concern #1] The overall presentation and conceptual methodology of the paper is very weak and lots of advanced papers are already published. No Strong analysis and experimental results are observed in the paper. No Novelty is there. It is the work of simple theoretical description but even the actual research orientation is missing in the paper.

Response: We apologize for the confusion. The manuscript is “application note” rather than “original research”, which introduces useful Python and Apache Spark libraries of Hi-LASSO for high-dimensional feature selection. However, the manuscript demonstrates an innovative concept of Hi-LASSO with statistical significance test with efficient implementation of bootstrapping in a parallel manner, and we conducted intensive experiments showing that the Python and Apache Spark libraries produce outstanding feature selection performance comparing to current benchmark LASSO models.

The original paper demonstrated the capability of LASSO as both a regression model and a feature selection approach. Whereas, this paper mainly improved parametric statistical tests for feature selection on high-dimensional data. Moreover, the original paper of Hi-LASSO assessed the feature selection performance using F1-scores by a threshold that maximizes the Root Mean Square Error (RMSE) of the validation data without a parametric statistical test, whereas this study conducted the experiments with further feature selection process that statistically combines bootstrapping results (i.e., using PSTFSboot), which does not require validation data. Furthermore, we introduce practical settings for tuning of hyperparameters in Hi-LASSO with various experiments. We also showed robustness of Hi-LASSO in the experiments. Therefore, we believe that this application note would be impactful and valuable as a general feature selection tool for a number of applications.

Reviewer # 2

[Concern #1] Few experiments can be repeated or justified for f1 scores. The literature study can strengthen with more recent papers. The authors can state how the current standards are maintained, materials and methods are not cited with previous works. the authors can consider the below works for better literature

Response: We appreciate the constructive comment. Feature selection can use various evaluation metrics. Since we used simulation data where ground truths are known, we used F1-scores which is a balanced measurement between precision and recall. F1-scores show how accurately Hi-LASSO can select true features in the models. We considered relevant variables (|B|>0) as positive and irrelevant variables (b=0) as negative in a confusion matrix. We cited the literature [11], as the reviewer suggested in Page 4.

We also added the sentence below to clarify Hi-LASSO’s advantage in Page 3 using the reference [9] that the reviewer suggested, as below:

“PSTFSboot allows Hi-LASSO to robustly perform feature selection from multiple bootstrapping results, as a filter feature selection, while most LASSO models are wrapper-based feature selection [9]”

The original paper of Hi-LASSO assessed the feature selection performance using F1-scores by a threshold that maximizes the Root Mean Square Error (RMSE) of the validation data without a parametric statistical test, whereas this study conducted the experiments with further feature selection process that statistically combines bootstrapping results (i.e., using PSTFSboot), which does not require validation data.

[9] Subbiah, Siva Sankari, and Jayakumar Chinnappan. Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review. Ing ´enierie des Syst`emes d’Information 26.1 (2021).

[11] Bol ´on-Canedo, Ver ´onica, and Amparo Alonso-Betanzos. Ensembles for feature selection: A review and future trends. Information Fusion 52 (2019): 1-12.

Reviewer # 3

[Concern #1] This paper presents a new implementation of a previously published algorithm called Hi-LASSO, with parallel computations that make the algorithm more practical for use with large real-world data sets. It shows experiments on both synthetic and real data that demonstrate the algorithm's utility for feature selection in high-dimensional data sets. Comparisons to a Spark implementation are shown, with performance results indicating the scalability of the method. Finally, the work describes the model's hyperparameters and robustness. I found the paper to be relatively well written overall, with a reasonable order of its sections. This paper will be a good candidate for PLOS ONE with some or all of the revisions suggested below.

Response: We appreciate the nice summary of the contribution in the study.

[Concern #2] The previous paper on this algorithm used Relative Model Error, Root Mean Square Error, and F1 scores. Why are only F1 scores reported in this work? An explanation of the choice of metric would help strengthen the data.

Response: We appreciate the constructive comment and apologize for the confusion. The original paper demonstrated the capability of LASSO as both a regression model and a feature selection approach. Whereas, this paper mainly improved parametric statistical tests for feature selection on high-dimensional data. F1-scores show how accurately Hi-LASSO can select true features in the models. Note that the original paper of Hi-LASSO assessed the feature selection performance using F1-scores by a threshold that maximizes the Root Mean Square Error (RMSE) of the validation data without a parametric statistical test, whereas this study conducted the experiments with further feature selection process that statistically combines bootstrapping results (i.e., using PSTFSboot), which does not require validation data. Meanwhile, Lower Relative Model Error and Root Mean Square Errors are already proven in the previous paper. We updated the justification for the choice of the metric in Page 4 as below:

“Note that the original paper of Hi-LASSO assessed the feature selection performance using F1-scores by a threshold that maximizes the Root Mean Square Error (RMSE) of the validation data without a parametric statistical test, whereas this study conducted the experiments with further feature selection process that statistically combines bootstrapping results (i.e., using PSTFSboot), which does not require validation data.”

[Concern #3] The hyperparameters q1, q2, L, alpha should be described in further detail. These are described a little bit in the "tuning" section. However, it would be helpful to know not only the trends in how performance is affected, but also how to choose an initial value for each. It appears there is an "auto" setting in the Python package but that automatic behavior is not described in the paper from what I could tell.

Response: We apologize for the confusion. “Tuning of hyper-parameters” was described in the supplementary in detail, but now we added the section in the main manuscript in Page 6, because it is important information for parameter tuning. In the section, we provide not only the trends in how performance is affected, but also how to choose an initial value as default. Also, we changed “auto” to “default” setting in the python package to avoid the confusion. We described how we chose the default values in the section of “tuning of hyper-parameters”, as below:

“The optimization of the hyper-parameters, q_1, q_2, and L, is often critical to the performance of feature selection in Hi-LASSO. We investigated how the hyper-parameters affect the performance of Hi-LASSO using the two simulation data (S3_File). We compared F1-scores by varying values of the hyper-parameters, where we set the identical values for q_1 and q_2 (i.e., q=q_1=q_2) for the sake of simplicity. We empirically found that the optimal values of q were around of the sample size. Generally, the larger L improved the performance in the experiments. However, L > 50 does not improve the performance significantly in the experiments. Empirically, the optimal value of L was 30, which approximates normal distribution by the central limit theorem, when the distribution is unknown.”

[Concern #4] Were hyperparameters optimized for all LASSO algorithms? How did the authors ensure that all algorithms were fairly assessed? It is surprising to see so many algorithms with F1 scores of zero in the BRCA dataset. Similarly, it is surprising to see the results in Table S5. Is there another dataset that shows a nonzero score for some of the compared algorithms?

Response: We apologize for the confusion. We followed the hyper-parameter optimization strategies for Precision and Relaxed LASSO, as their original papers proposed. For Hi-LASSO, we set L as 30 and q1 and q2 as the sample size, according to the experiments of hyper-parameter tuning (We moved the section from the supplementary to the main manuscript). Random LASSO and Recursive Random LASSO do not provide instructions for hyper-parameters, so we set the same hyper-parameters as Hi-LASSO. For the other benchmark models, the optimal hyper-parameters of L1 or L2-norm regularization \\mathbit{\\lambda} were obtained to minimize the prediction error with inner 5-fold cross validation in the training data. We repeated the experiments ten times by randomly generating simulation data for reproducibility. We updated the information in the manuscript in Page 4.

For the issue of zero F1-scores in BRCA, many LASSOs showed unstable results in feature selection, where most conventional LASSO seldom identified relevant features and Random/Precision identified too many non-zero features. Note that the datasets are semi-real data, where ground truth non-zero variables were known. It may be because of hyper-parameter of “lambda”, which optimized with inner cross validation in the training data. The benchmark models performed relatively well on the other datasets of GBM, LGG, and OV. So, we don’t believe that the implementation was incorrect. Whereas, Hi-LASSO with PSTFSboot performed stably for the feature selection on the all datasets. Table S5 was mistakenly included in the supplementary, which is not relevant experimental results.

[Concern #5] I find it a little hard to believe that Hi-LASSO is this much better than similar algorithms without more information about how each algorithm was run, to ensure fairness in the assessment. Are there cases where Hi-LASSO performs poorly? If so, it would be helpful to include such a case for a baseline. How does Hi-LASSO perform in lower-dimensional cases with more data where other LASSO algorithms have been used in the past? Comparisons like this would help reduce the sense that the datasets are cherry-picked for Hi-LASSO's benefit, and would help to illuminate the contrast between prior art and this algorithm's improvements for specific types of problems.

Response: We apologize for the confusion. We believe that our implementation and experiments were fair to all the benchmark models, as we elucidated the experiment settings and repeated the simulation experiments multiple times. We also clarified how we tuned the hyper-parameters on each model. Although Hi-LASSO with PSTFSboot showed the best performance in the most experiments, there was a case that Random LASSO was better in Table 1.

Hi-LASSO is mainly designed for High-Dimensional, but Low-Sample Size (HDLSS) data. So, we did not consider lower-dimensional data in the experiments (Lower-dimensional data is not our scope of the study). However, we expect that Hi-LASSO is a still useful model for the lower-dimensional data because we provide statistical test for feature selection using PSTFSboot.

We did not cherry-picked datasets. The simulation setting is very widely-used for the simulation study in LASSO, where we considered very variety settings for high-dimensional, low sample size data.

We consider the four semi synthetic data using TCGA datasets, because the four datasets are with very large sample sizes (also very high-dimensional) comparing to the other cancer datasets.

[Concern #6] Some of the results are a bit surprising, with several comparison methods yielding few or no positive results. This may indicate the selection of overly specific benchmark data sets, or a lack of competitive algorithms for comparison. A bit more explanation of the results in these areas would benefit the reader as well as make the work more defensible. The authors' claim of "extraordinary performance" appears to be somewhat supported by the data that is presented, but it is a little unclear whether this is due to a selective choice of benchmarks. Understanding where the algorithm fails (or performs in an "average" way) is important for readers who wish to make practical use of the package.

Response: We apologize for the confusion. We considered simulated data and semi-real datasets for the assessment. The experimental settings are conventional for the simulation studies, so we do not believe that the experiment datasets are flavored to our model.

We considered seven benchmark LASSO models, including conventional LASSO (1996), ElasticNet (2005), Adaptive (2006), Relaxed (2007), Random (2011), Recursive (2015), and Precision (2019), which are representatives of LASSO variations.

[Concern #7] The introduction or conclusions should spend more time contextualizing this algorithm. What fields should consider adopting Hi-LASSO? Genomics may be one such candidate, but other potential applications should be described.

Response: We appreciate the constructive comment. Applications in any fields involving high-dimensional data can leverage Hi-LASSO. We added it in the main manuscript in Page 2 as below:

“LASSO is a popular feature selection approach for high-dimensional data in various fields, such as biomedical, Internet of Things, social media, and engineering research [8, 9]”

[8] Wang, Chen, and Liu. Establish algebraic data-driven constitutive models for elastic solids with a tensorial sparse symbolic regression method and a hybrid feature selection technique. Journal of the mechanics and physics of Solid, 2022.

[9] Subbiah, Siva Sankari, and Jayakumar Chinnappan. Opportunities and Challenges of Feature Selection Methods for High Dimensional Data: A Review. Ing ´enierie des Syst`emes d’Information 26.1 (2021).

[Concern #8] It would be good to summarize the contributions of each author to the work, perhaps using a standardized framework like CRediT (Contributor Roles Taxonomy).

Response: We added the section of “Author contributions” in Page 7.

[Concern #9] Figure 1 is hard to read and should be higher resolution - ideally a vector graphic format like PDF or EPS. Same for supplementary figures S1, S2.

Response: We appreciate the constructive comment. We updated the figure to clearly show the experimental results in Page 5. The figure is now with 300 dpi.

[Concern #10] Figure 1(B) could be replaced by a scatter plot showing weak scaling performance for the process parallel and Spark implementations from 1 core to the number of cores in the benchmarking machine. This could be for one dataset, or a geometric average of a few datasets. Weak scaling plots are far more useful to understand computational efficiency than a raw speedup chart with no clear baseline. It's not clear if the speedup is linear with the number of cores, which a weak scaling plot would help indicate.

Response: We appreciate the constructive comment. We updated the figure 1b with speedup over various numbers of processors (1-96) and compared the efficiency between the spark and python version with the baseline of the initial paper of a single process.

[Concern #11] Grammatical / typographical comments:

- Line 24: "impeded to apply Hi-LASSO for practical applications" should say "impeded practical applications of Hi-LASSO"

- Line 111: should say "desired average number of times"

- Line 156: "the Apache version" should say "the Apache Spark version." Apache Spark (or Spark for short) is the proper name of the library -- not just "Apache."

- Line 198: missing a subscript on q1

Response: We appreciate. We corrected the grammar errors.

[Concern #12] References / reproducibility comments:

- The TCGA data sets should be cited.

Response: We appreciate. We added the link in the footnote for the dataset.

[Concern #13]

- In accordance with the PLOS ONE "Exceptions to sharing materials" (https://journals.plos.org/plosone/s/materials-software-and-code-sharing), the "authors should include a statement in their Materials and Methods discussing any restrictions on availability or use." It appears the TCGA data is subject to controlled access. This should be made clear to the reader, with information about how to access these controlled datasets (if possible) in order to make the results reproducible.

Response: We appreciate the comments. We downloaded the cancer genomic data from https://www.cbioportal.org, which does not have access restrictions. We added it in Page 5 as below:

“We downloaded the cancer genomic data from https://www.cbioportal.org.”

[Concern #14] The code used to generate synthetic Datasets I - IV does not appear to be included in the linked GitHub repository (I looked in the benchmark models and sample data directories). That should be included to meet PLOS ONE data sharing policies, along with a script to execute the code in the benchmark models directory for all benchmarks on the synthetic data.

Response: We added the source codes for generation of simulation data as well as benchmark models at https://github.com/datax-lab/Hi-LASSO

[Concern #15] Check the capitalization of journal names and article titles in the references section. Some have unexpected lowercase letters.

Response: We have checked the journal names and article titles in the references section.

[Concern #16]

- Please cite all relevant scientific software packages used in the hi_lasso software, such as NumPy and SciPy. See https://numpy.org/citing-numpy/ and https://scipy.org/citing-scipy/ for examples.

Response: We cited all relevant scientific software packages used in the hi_lasso software such as glmnet, NumPy, Scipy at https://hi-lasso.readthedocs.io

[Concern #17] Line 116 of the paper: Rather than describing both "parallel" and "n_jobs", just let "n_jobs" default to 1 (the serial case). Then only one parameter is needed, and "parallel" can be removed. A special value of "n_jobs is None" or "n_jobs == 0" could use the number of CPU cores returned by "multiprocessing.cpu_count()" for automatic parallelization across all available cores.

Response: We appreciate the constructive comment. We removed the ‘parallel’ parameter and corrected the ‘n_jobs’ parameter. You can also check the code at:

https://github.com/datax-lab/Hi-LASSO/blob/master/hi_lasso/hi_lasso.py

[Concern #18] The choice of the MIT license is good for future works to build on this one!

Response: We apologize for the confusion. We have already chosen the MIT license. You can check at: https://github.com/datax-lab/Hi-LASSO/blob/master/LICENSE.

[Concern #19] Could the Spark and non-Spark libraries be combined, or make the Spark library use the base Python library as a dependency? The two code paths look fairly unrelated right now.

Response: We apologize for the confusion. The Spark engine provides essential components and libraries to handle distributed data, and our Spark version is implemented on the Spark engine. Thus, the spark and python version cannot be combined. Note that the Python version improves the efficiency using parallel processing while providing statistical testing strategy, whereas the Spark version is for distributed data. However, we made the same interface for the function, so that anyone can the two libraries easily.

[Concern #20] The "simulation_data" folder on GitHub could include a README that indicates where the data came from or how it was generated.

Response: We added a README for how simulation data is generated. Please check:

https://github.com/datax-lab/Hi-LASSO/blob/master/simulation_data/README.md

[Concern #21] It is my hope that the authors will consider adapting this algorithm for inclusion in a popular toolkit such as scikit-learn after publication. It seems like a helpful algorithm.

Response: We thank the reviewer for an excellent suggestion. We will try it in a near future.

Attachment

Submitted filename: Reviewer_Response_Final.pdf

Decision Letter 1

Sathishkumar V E

21 Nov 2022

Hi-LASSO: High-performance Python and Apache spark packages for feature selection with high-dimensional data

PONE-D-22-19015R1

Dear Dr. Kang,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Sathishkumar V E

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The Revised paper has incorporated all the revisions as mentioned in the last review, and now the paper looks Ok in all aspects. So, the paper stands Accepted with no further revisions.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

**********

Acceptance letter

Sathishkumar V E

24 Nov 2022

PONE-D-22-19015R1

Hi-LASSO: High-performance python and apache spark packages for feature selection with high-dimensional data

Dear Dr. Kang:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Sathishkumar V E

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File. Simulation study.

    (PDF)

    S2 File. Performance for efficiency.

    (PDF)

    S3 File. Tuning the hyper-parameters in Hi-LASSO.

    (PDF)

    S4 File. Robustness analysis using Kuncheva Index (KI).

    (PDF)

    Attachment

    Submitted filename: Reviewer_Response_Final.pdf

    Data Availability Statement

    All relevant data are within the paper and its Supporting information files. Hi-LASSO packages are publicly available on GitHub at https://github.com/datax-lab/Hi-LASSO under the MIT license. The packages can be easily installed by Python PIP, and additional documentation is available at https://pypi.org/project/hi-lasso and https://pypi.org/project/Hi-LASSO-spark.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES