EasyParallel: A GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

Honggang Zhao; Benjamin Beck; Adam Fuller; Eric Peatman

doi:10.1371/journal.pone.0232110

. 2020 Apr 24;15(4):e0232110. doi: 10.1371/journal.pone.0232110

EasyParallel: A GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

Honggang Zhao ^1,^*, Benjamin Beck ², Adam Fuller ³, Eric Peatman ¹

Editor: Roberto Fritsche-Neto⁴

PMCID: PMC7182190 PMID: 32330179

Abstract

The software programs STRUCTURE and NEWHYBRIDS are widely used population genetic programs useful in addressing questions related to genetic structure, admixture, and hybridization. These programs usually require a large number of independent runs with many iterations to provide robust data for downstream analyses, thus significantly increasing computation time. Programs such as Structure_threader and parallelnewhybrid were previously developed to address this problem by processing tasks in parallel on a multi-threaded processor; however some programming knowledge (e.g., R, Bash) is required to run these programs. We developed EasyParallel as a community resource to facilitate practical and routine population structure and hybridization analyses. The multi-threaded parallelization of EasyParallel allows processing of large genetic datasets in a very efficient way, with its point-and-click GUI providing ready access to users who have little experience in script programming. Performance evaluation of EasyParallel using simulated datasets showed similar speed-up and parallel execution time when compared to Structure_threader and Parallelnewhybrid. EasyParallel is written in Python 3 and freely available on the GitHub site https://github.com/hzz0024/EasyParallel.

1. Introduction

Recent advances in next-generation sequencing (NGS) technologies and the decreased cost of NGS have led to a rapid accumulation of genetic data for both model and non-model organisms [1]. To accommodate this data explosion, new tools and computation platforms were developed to perform parallelized data analyses [2,3]. However, most of these programs were compiled and executed in command-line based environments (e.g., Linux, R), which could make them less accessible and appealing to users who have little programming background. Moreover, some programs require independent runs with many iterations to provide robust data for downstream analysis, making it time-consuming when the dataset includes a large number of individuals and genetic markers.

One such example is STRUCTURE [4]. This Bayesian-based clustering approach utilizes individual genotypes and population allele frequencies to cluster individuals, with the assumptions of Hardy–Weinberg and linkage equilibrium of marker loci within populations [4]. Since its publication, STRUCTURE has been widely applied to address questions related to population structure, species or individual assignment, hybridization and introgression [5–10]. Because STRUCTURE requires to minimize the effect of the starting configuration, many iterations are needed during the burnin process [6]. More importantly, STRUCTURE is usually run with many iterations for different genetic cluster values (K) to determine the optimal number of populations [11], thus significantly increasing computational times.

Another program requiring a large number of independent runs is NEWHYBRIDS [12]. Using Bayesian model-based clustering and MCMC simulation, NEWHYBRIDS computes the posterior probability of each individual that falls into distinct hybrid classes [12]. Although both programs were designed with graphical interfaces and cross-platform compatibility (Linux, Windows, and MacOS), the native GUIs do not facilitate multiple independent analyses. Additionally, parameters and input files must be copied and edited manually between runs, which introduces the potential for human errors [13]. To increase the efficiency and speed of running these programs, strategies such as parallel processing and script programming on multiple cores/threads have been previously proposed for STRUCTURE or NEWHYBRIDS analyses [13–16]. Although these strategies are invariably more convenient and efficient, some knowledge of programming languages is still needed.

The program EasyParallel presented in this article is provided as a free cross-platform tool that utilizes a multi-thread parallel algorithm for processing multiple iterations of STRUCTURE and NEWHYBRIDS analyses. EasyParallel employs a user-friendly graphical user interface (GUI) and multi-core parallelization for multiple independent runs of a dataset.

2. Materials and methods

2.1 Overview

EasyParallel is freely available at https://github.com/hzz0024/EasyParallel with installation instructions and a brief demo provided in the Documentation site. EasyParallel requires the command-line version of STRUCTURE and NEWHYBRIDS programs. Thus, a user must download the correct version of the target program and load the main directory (with executable files) to EasyParallel. Python is used for directory creation, data processing, parallel runs, and file writing operations. At present, EasyParallel can perform parallel replication runs for STRUCTURE and NEWHYBRIDS across MacOS and Windows operating systems, with all source code packaged for the direct run without installation. However, the open-source design of EasyParallel can be extended to other compatible software that requires multiple iterations or simulations for data analysis.

2.2 Parallel scheme

In order to achieve parallelism, one intuitive approach is to copy the entire folder n times (n is the number of the run) and run each copy in parallel. On the contrary, we use a “single executable multiple working directories” scheme–i.e., each subprocess executes the same executable file, but in different working directories. The “multiple working directories” design is implemented with the subprocess management (https://docs.python.org/2/library/subprocess.html) of Python Standard Library, a module which is able to set the child working directory before it is executed. The benefit of our design is two-fold: 1) we execute the software n times in parallel without the necessity to make n copies of the executable file. All the child processes share the same executable file, and produce the outputs in an independent directory; 2) EasyParallel platform is not confined by output constraints (e.g., NEWHYBRIDS does not allow specification of an output directory and generates outputs into the working directory instead). In our design, such constraints are addressed by designated working directories.

2.3 User-friendly GUI

For the EasyParallel graphical user interface (GUI), we provide a progress bar and a window to show the status of parallelization (Fig 1). Because both STRUCTURE and NEWHYBRIDS require specific parameters for data running, the software interface for each module was designed to support parameter modification (e.g., number of repeats and threads used for parallel execution). In addition, the user could specify the location of additional datasets and parameter input files in an intuitive and convenient manner (e.g., drag mainparams and extraparams files directly to the EasyParallel GUI for STRUCTURE analysis). If not supplied by the user, the default settings of parameter files archived from the target program will be used.

2.4 Execution time analyses

We used two datasets available in Pina-Martins et al., [15] and Wringe [16] to evaluate the execution time and speed gain of EasyParallel in STRUCTURE and NEWHYBRIDS analyses, respectively. We used the GUI version for execution time analyses. Four laptops with various core architectures (2, 4, and 6 physical cores) and different operating systems (Windows and MacOS) were used for performance comparison: Lenovo Y510, Windows 10, 2.4 GHz Intel Core i7- 4700MQ with 8 GB RAM, 4 physical cores with 8 logical threads (i7 4700MQ); Lenovo Y700, Windows 10, 2.6 GHz Intel Core i7-6700HQ with 8 GB RAM, 4 physical cores with 8 logical threads (i7 6700HQ); MacBook Pro, OS 10.14, 2.7 GHz Intel Core i5 with 16 GB RAM, 2 physical cores (MacPro i5); MacBook Pro, OS 10.14, 2.6 GHz Intel Core i7 with 16 GB RAM, 6 physical cores (MacPro i7). The test file used for STRUCTURE analysis consisted of 100 individuals and 80 single nucleotide polymorphism (SNP) loci (total 8,000 genotypes with no missing data). This dataset was initially crafted based on data from the 1,000 Genomes Project (The 1,000 Genomes Project Consortium, 2015) and is available in the program’s repository. STRUCTURE was run using the admixture model with correlated allele frequencies and 5 × 10⁴ burn-in period followed by 1 × 10⁶ Markov Chain Monte Carlo (MCMC) repeats. These settings were applied for values of K ranging from 1 to 4, with four independent runs for each K (a total of 16 STRUCTURE runs). For NEWHYBRIDS, eight independent analyses were run on a simulated data set with 100 loci and 200 individuals for each of the six genotype frequency classes (pure1, pure2, F1, F2, BC1, and BC2), with an initial burn-in of 500 replicates and 1,000 MCMC sweeps afterward (following the same settings as Wringe et al, [16]). To assess the execution time obtained by parallelization in EasyParallel, we computed the “speed up” values using the equation of S_p = T₁/T_p, where S_p is the speed-up obtained by distributing one analysis on p threads, T₁ is the execution time on a single thread (sequential run), and T_p is the execution time of the task on p threads [13]. We also compared the parallel performance between EasyParallel and two existing software, Structure_threader and Parallelnewhybrid, by using the same parameter settings and datasets for parallel analyses. Structure_threader was previously proven to be more efficient and faster than similar multiple-thread methods for performing multiple STRUCTURE runs (StrAuto and ParallelStructure), and therefore was considered an optimal target for performance comparison [13–15]. Parallelnewhybrid was the only known R package designed to execute multiple NEWHYBRIDS runs in parallel [16].

3. Results and discussion

For all STRUCTURE and NEWHYBRIDS analyses, the parallel computational time in EasyParallel was faster than a sequential run using a single thread in general (Fig 2, Table 1). However, we note that the speed gain of parallelization was not linear with the increased number of threads. This phenomenon has been previously reported in other parallel programs [13,15,16]. One potential explanation for this nonlinearity is that the operating system and processor must deal with computation resources utilized by intensive tasks (i.e. STRUCTURE and NEWHYBRIDS parallel runs) and underlying system processes, therefore affecting the performance of parallelization [16]. On the other hand, the occurrence of “Cache trashing” may impact the speed of parallelization when working with larger data sets and when both logical threads (in one physical core) share L2 and L3 caches [15]. However, despite the nonlinearity issue, we observed that the performance of EasyParallel was not limited by the availability of random access memory (RAM), as the usage of RAM was always low during parallelization.

Fig 2 — The speed increase was calculated by dividing the execution time on a single thread (sequential run) by the execution time obtained from different number of threads. i7 4700MQ ‒ Lenovo Y510, Windows 10, 2.4 GHz Intel Core i7- 4700MQ with 8 GB RAM and 4 physical cores (8 logical threads); i7 6700HQ ‒ Lenovo Y700, Windows 10, 2.6 GHz Intel Core i7-6700HQ with 8 GB RAM and 4 physical cores (8 logical threads); MacPro i5 ‒ MacBook Pro, OS 10.14, 2.7 GHz Intel Core i5 with 16 GB RAM and 2 physical cores; MacPro i7 ‒ MacBook Pro, OS 10.14, 2.6 GHz Intel Core i7 with 16 GB RAM and 6 physical cores.

Table 1. Computational time (s) required to complete STRUCTURE and NEWHYBRIDS analyses in series compared to in parallel using EasyParallel, Structure_threader, and Parallelnewhybrid.

The speed gain (in parentheses) was calculated by dividing the execution time on a single thread (sequential run) by the execution time obtained from different number of threads. The analyses were repeated using different operating system and CPU architectures: i7 4700MQ ‒ Lenovo Y510, Windows 10, 2.4 GHz Intel Core i7- 4700MQ with 8 GB RAM and 4 physical cores (8 logical threads); i7 6700HQ ‒ Lenovo Y700, Windows 10, 2.6 GHz Intel Core i7-6700HQ with 8 GB RAM and 4 physical cores (8 logical threads); MacPro i5 ‒ MacBook Pro, OS 10.14, 2.7 GHz Intel Core i5 with 16 GB RAM and 2 physical cores; MacPro i7 ‒ MacBook Pro, OS 10.14, 2.6 GHz Intel Core i7 with 16 GB RAM and 6 physical cores.

Threads	i7 6700HQ	i7 4700MQ	MacPro i5	MacPro i7
EasyParallel (STRUCTURE)
1	14711	14943	8226	5307
2	7772 (1.89)	7929 (1.88)	4143 (1.99)	2785 (1.91)
4	4052 (3.63)	5212 (2.87)	‒	1561 (3.40)
6	3617 (4.07)	5106 (2.93)	‒	1300 (4.08)
8	3049 (4.82)	4733 (3.16)	‒	‒
Structure_threader
1	14688	14980	8193	5328
2	7762 (1.89)	7808 (1.92)	4145 (1.98)	2811 (1.90)
4	4040 (3.64)	5255 (2.85)	‒	1551 (3.44)
6	3597 (4.08)	5099 (2.94)	‒	1282 (4.16)
8	2999 (4.90)	4708 (3.18)	‒	‒
EasyParallel (NEWHYBRIDS)
1	1574	1594	793	683
2	810 (1.94)	820 (1.94)	418 (1.90)	375 (1.82)
4	489 (3.22)	606 (2.63)	‒	206 (3.32)
6	480 (3.28)	551 (2.89)	‒	205 (3.33)
8	330 (4.77)	407 (3.92)	‒	‒
Parallelnewhybrid
1	1500	1617	828	710
2	800 (1.87)	837 (1.91)	445 (1.86)	377 (1.88)
4	477 (3.08)	562 (2.81)	‒	208 (3.39)
6	478 (3.19)	553 (2.86)	‒	206 (3.36)
8	323 (4.69)	403 (3.92)	‒	‒

Open in a new tab

The runtime and speed gain obtained by EasyParallel, Structure_threader, and Parallelnewhybrid were very similar (Fig 2, Table 1), regardless of the number of threads, operating systems, or CPU processors used for the analysis. The same implementation of “multiprocessing” and “subprocess” modules in both EasyParallel and Structure_threader would explain the minimal difference in performance for repeated STRUCTURE running. On the other hand, although EasyParallel and Parallelnewhybrid performed equally well in analyzing multiple simulated data sets, EasyParallel was more efficient in processing the input data, as each thread shared the same executable input file. Parallelnewhybrid, however, needs to duplicate the input data for each thread execution and produce temporary files during parallel computing. Beyond that, the key feature of EasyParallel is its graphical user interface, which facilitates data processing and makes it accessible to users who have limited knowledge in any programming language.

4. Conclusion

In summary, we have developed a Python-based software that assists users working with iteration processes in STRUCTURE and NEWHYBRIDS analyses. EasyParallel is a user-friendly and free platform that combines a point-and-click interface and multi-core parallelization for multiple independent runs of the dataset, assisting the user in assessing the most biologically likely K and estimating hybrid class assignment accuracy. EasyParallel is also a stand-alone software executable for both MacOS and Windows operating systems, with all modules and the source code packaged for the direct run without installation.

Acknowledgments

The authors wish to thank Wenlu Wang for code debugging. The authors appreciate the help of Katherine Silliman and Matt Lewis in manuscript revision and in-house program tests.

Data Availability

The software, source code, user manual, and example data sets are available online from https://github.com/hzz0024/EasyParallel

Funding Statement

The authors received no specific funding for this work.

References

1.Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics 17, 333 10.1038/nrg.2016.49 [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Wang W, Zhang J, Sun M-T, Ku W-S (2017) Efficient parallel spatial skyline evaluation using MapReduce.
3.Wang W, Zhang J, Sun M-T, Ku W-S (2019) A scalable spatial skyline evaluation system utilizing parallel independent region groups. The VLDB Journal—The International Journal on Very Large Data Bases 28, 73–98. [Google Scholar]
4.Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155, 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Hargrove JS, Rogers MW, Kacmar PT, Black P (2019) A statewide evaluation of Florida Bass genetic introgression in Tennessee. North American Journal of Fisheries Management. [Google Scholar]
6.Porras-Hurtado L, Ruiz Y, Santos C, et al. (2013) An overview of STRUCTURE: applications, parameter settings, and supporting software. Front Genet 4, 98 10.3389/fgene.2013.00098 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Silliman K (2019) Population structure, genetic connectivity, and adaptation in the Olympia oyster (Ostrea lurida) along the west coast of North America. Evolutionary applications 12, 923–939. 10.1111/eva.12766 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Thongda W, Lewis M, Zhao H, et al. (2019) Species-diagnostic SNP markers for the black basses (Micropterus spp.): a new tool for black bass conservation and management. Conservation Genetics Resources, 1–10. [Google Scholar]
9.Thongda W, Zhao H, Zhang D, et al. (2018) Development of SNP panels as a new tool to assess the genetic diversity, population structure, and parentage analysis of the eastern oyster (Crassostrea virginica). Marine biotechnology 20, 385–395. 10.1007/s10126-018-9803-y [DOI] [PubMed] [Google Scholar]
10.Zhao H, Fuller A, Thongda W, et al. (2019) SNP panel development for genetic management of wild and domesticated white bass (Morone chrysops). Animal genetics 50, 92–96. 10.1111/age.12747 [DOI] [PubMed] [Google Scholar]
11.Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14, 2611–2620. 10.1111/j.1365-294X.2005.02553.x [DOI] [PubMed] [Google Scholar]
12.Anderson E, Thompson E (2002) A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160, 1217–1229. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Besnier F, Glover KA (2013) ParallelStructure: AR package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers. PLoS One 8, e70651 10.1371/journal.pone.0070651 [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Chhatre VE, Emerson KJ (2017) StrAuto: automation and parallelization of STRUCTURE analysis. BMC bioinformatics 18, 192 10.1186/s12859-017-1593-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Pina‐Martins F, Silva DN, Fino J, Paulo OS (2017) Structure_threader: An improved method for automation and parallelization of programs structure, fastStructure and MavericK on multicore CPU systems. Molecular ecology resources 17, e268–e274. 10.1111/1755-0998.12702 [DOI] [PubMed] [Google Scholar]
16.Wringe BF, Stanley RR, Jeffery NW, Anderson EC, Bradbury IR (2017) parallelnewhybrid: an R package for the parallelization of hybrid detection using newhybrids. Molecular ecology resources 17, 91–95. 10.1111/1755-0998.12597 [DOI] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0232110.r001

Decision Letter 0

Roberto Fritsche-Neto

27 Dec 2019

PONE-D-19-32193

EasyParallel: a GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

PLOS ONE

Dear Dr. Zhao,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Feb 10 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Roberto Fritsche-Neto, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This manuscript has the potential to be useful and talks about one GUI platform for parallelization. Its point-and-click, simple and intuitive platform called EasyParallel. I ran the example data that comes with STRUCTURE and NEWHYBRIDS following the “Demo” available at http://webhome.auburn.edu/~hzz0024/web/demo/ and works pretty well in my computer. However, it’s pretty similar with already exist for parallelization. I really recommend the authors include more functions or tools to let the software more attractive, for example: function to help the users to build the input files for STRUCTURE and NEWHYBRIDS, in other words, a similar approach of questions and answers where the user inform the parameters to get the input files for parallelization. Also, the authors could add a window for results interpretation with, for example graphics and tables, because if the main goal is people with little experience in programming with the actual version of EasyParallel those people still needing other software to process the outputs and make graphics. I tried to insert point-by-point my comments to help in the correction process:

Line 55-57: Its redundant.

Line 70: “ … Monte Carlo (MCMC) to resampling …“

Line 71: I had the impression that the authors are using the word “burnin” as the same idea as total “iterations”. Normally, only the first portion of interactions are called by burnin, where the interaction process exercised the priors. This testing will force failures under supervised conditions and then established the interactions. Could you review this sentence to follow the cited paper Porras-Hurtado et al (2013): “STRUCTURE uses a systematic Bayesian clustering approach applying Markov Chain Monte Carlo (MCMC) estimation. The MCMC process begins by randomly assigning individuals to a pre-determined number of groups, then variant frequencies are estimated in each group and individuals re-assigned based on those frequency estimates. This is repeated many times, typically comprising 100,000 iterations, in the burnin process that results in a progressive convergence toward reliable allele frequency estimates in each population and membership probabilities of individuals to a population.

Measurement of the assumed number of populations uses the MCMC estimation and is performed separately from the burnin.”.

Line 91-92: This is not one of your objectives. In my opinion EasyParallel is doing only the parallelization process and the other software (STRUCTURE and NEWHYBRIDS) are “assisting the user in assessing the most biologically likely number of clusters (K) and estimating hybrid class assignment accurately.”.

Line 128: It is not clear if the “Execution Time Analyses” was performed using the GUI version or code line.

Line 133-138: Why the authors chose these specific machines? Why did not you use computers with i3 or 4GB RAM, that regular people with no experience with programing have?

Line 143: “MCMC” also is an iteration process as “burnin”, please be consistent.

Line 162: “always” it is not true for “i7.6700HQ” and “MacPro i5” during the STRUCTURE comparation. Please use terms as “in general” or “majority”.

Line 171-173: I did not find these results. Could you add a table as a supplementary file with these RAM results? I believe the execution time is being influenced by the fact of MacOS’s computers have twice RAM than Windows’s.

Line 173-177: This comparation should be made with similar machines, same processor (i3, i5, i7, or i9), RAM, physical cores, etc. Also, the comparation of operating systems is not one of your objectives.

Line 195-197: Only in this last sentence is clear that the software doesn’t has the option to run in command code or prompt. Please be more specific and move this part for material and methods.

Figure 1: I confess that I spent a time trying to understand why the EasyParallel logo is a fish. I understand the group works with aquaculture but using a fish as logo is not helping at all to get attention for the software. I really recommend change the logo for some genetic or parallel symbol. In addition, it will be interesting and helpful if the software provides some results visualization as graphics and tables as suggested above.

Figure 2: The authors are comparing computers and not software, which it is the main idea. I recommend to exchange the position of software and computer where the lines should be the software (EasyParallel, STRUCTURE and NEWHYBRIDS).

Table 1: Use “-” instead NA.

The manual is clear and well done. Only one correction in “Step 3” where the mainparams was wrote twice at the link: http://webhome.auburn.edu/~hzz0024/web/doc/.

Reviewer #2: This paper presented a Python-based software named EasyParallel that assists users working with iteration processes in STRUCTURE and NEWHYBRIDS analyses. STRUCTURE and NEWHYBRIDS software are widely used in population genetic structure studies, admixture, and hybridization. The analyzes performed by these programs usually require a large computational time, especially when large genotyped populations with a large number of molecular markers are analyzed. The multi-threaded parallelization of EasyParallel allows processing of large genetic datasets in an efficient way, providing ready access to users who have little experience in script programming. The authors use clear and straightforward language and provide all relevant data and information. Therefore, I recommend accepting the article for publication.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 24;15(4):e0232110. doi: 10.1371/journal.pone.0232110.r002

Author response to Decision Letter 0

10 Mar 2020

PONE-D-19-32193

EasyParallel: a GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

PLOS ONE

Dear Dr. Zhao,

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

Please include the following items when submitting your revised manuscript:

• A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

• A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

• An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Roberto Fritsche-Neto, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

Reviewer #2: Yes

________________________________________

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: Yes

________________________________________

3. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

________________________________________

4. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

________________________________________

5. Review Comments to the Author

________________________________________

We appreciate reviewer’s comments here. We agree that programs such as Structure_threader and parallelnewhybrid have been developed to help process tasks in parallel on a multi-threaded processor. However, such software requires minimum programming knowledge (e.g., R, Bash) for program running. To our best knowledge, our program is the first GUI platform that supports parallel running of STRUCTURE and NEWHYBRIDS.

I really recommend the authors include more functions or tools to let the software more attractive, for example: function to help the users to build the input files for STRUCTURE and NEWHYBRIDS, in other words, a similar approach of questions and answers where the user inform the parameters to get the input files for parallelization. Also, the authors could add a window for results interpretation with, for example graphics and tables, because if the main goal is people with little experience in programming with the actual version of EasyParallel those people still needing other software to process the outputs and make graphics. I tried to insert point-by-point my comments to help in the correction process:

While we appreciate the reviewer’s suggestion here, we feel that it requires a careful design and lots of efforts to build a function for input manipulations, as the STRUCTURE itself needs two parameter files (mainparams and extraparams) along with the genotype input data. Both STRUCTURE (https://web.stanford.edu/group/pritchardlab/structure_software/release_versions/v2.3.4/structure_doc.pdf) and NEWHYBRIDS (https://github.com/eriqande/newhybrids/blob/master/new_hybs_doc1_1Beta3.pdf) did excellent jobs in documenting the input parameters. Besides, some existing programs such as widgetcon (Aydın et al., 2019), and PGDSpider (Lischer and Excoffier, 2012) have been well developed to prepare the input data. Therefore, we feel our way of presenting the front end is the most clear for readers in parallel computing, and in the future we will design such functions as suggested by the reviewer.

We agree that it would be helpful to develop a window for results interpretation and plotting. However, we found that some existing programs such as CLUMPAK (Kopelman et al., 2015), POPHELPER (Francis, 2017), StructureSelector (Li and Liu, 2017), and KFinder (Wang, 2019) have been widely adopted for output plotting and interpretation. We will consider reviewer’s advice and add this function in our next release.

Line 55-57: Its redundant.

We have rephased this sentence to make it simple and clean.

Line 70: “ … Monte Carlo (MCMC) to resampling …“

Corrected

We have rephased the sentence as “Because STRUCTURE requires to minimize the effect of the starting configuration, many iterations are needed during the burnin process.

We agree with it. Our manuscript explicitly explains our platform builds upon STRUCTURE and NEWHYBRIDS. What we tried to explain here is that our platform enables running multiple Ks (K within the predefined range) using one-click, thus facilitating the user in assessing the optimal K without running the program multiple times with different K repeatedly. We have addressed reviewer’s comment by deleting this sentence.

Line 128: It is not clear if the “Execution Time Analyses” was performed using the GUI version or code line.

The analyses are based on the GUI version, and we revised the manuscript to clarify the setting from line 97-99.

Line 133-138: Why the authors chose these specific machines? Why did not you use computers with i3 or 4GB RAM, that regular people with no experience with programing have?

Thank you for your valuable advice. We were trying to test on as many machines as possible, and test our performance in various settings. However, we decided to focus on forward compatibility instead of backward compatibility. We will try to cover more different settings in our next release.

Line 143: “MCMC” also is an iteration process as “burnin”, please be consistent.

Corrected

Line 162: “always” it is not true for “i7.6700HQ” and “MacPro i5” during the STRUCTURE comparation. Please use terms as “in general” or “majority”.

We have revised the manuscript accordingly.

We appreciate reviewer’s valuable comment here. However, we made this conclusion only by observing the real-time RAM usage and did not record such data. STRUCTURE and NEWHYBRIDS are not memory demanding algorithms, and the size of RAM is not a bottleneck for parallelization. Same observation has been also reported in Besnier et al. (2013) and Wringe et al. (2017).

We agree with reviewer that such comparisons should be made using same machinery settings. We have revised the manuscript accordingly.

Line 195-197: Only in this last sentence is clear that the software doesn’t has the option to run in command code or prompt. Please be more specific and move this part for material and methods.

We appreciate reviewer’s comment here. We have state this in the material and methods part (line 97-99). We hope it clarifies our intention.

We appreciate reviewer’s valuable comment and we do hope the software could be used in a broad community. We followed reviewer’s comment and redesigned the EasyParallel logo to make it easy to remember and reflect the parallelization. Again, we appreciate reviewer’s advice about results visualization. Please see our answer at the start of our response.

We agree with reviewer and redrew the Figure 1

Table 1: Use “-” instead NA.

Corrected

The manual is clear and well done. Only one correction in “Step 3” where the mainparams was wrote twice at the link: http://webhome.auburn.edu/~hzz0024/web/doc/.

Corrected

We appreciate reviewer’s comments and recommendation!

________________________________________

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

References:

Aydın, M., Kryvoruchko, I. S., & Şakiroğlu, M. (2019). widgetcon: A website and program for quick conversion among common population genetic data formats. Molecular Ecology Resources, 19(5), 1374-1377.

Besnier, F., & Glover, K. A. (2013). ParallelStructure: AR package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers. PLoS One, 8(7).

Francis, R. M. (2017). pophelper: an R package and web app to analyse and visualize population structure. Molecular Ecology Resources, 17(1), 27-32.

Li, Y. L., & Liu, J. X. (2018). StructureSelector: A web‐based software to select and visualize the optimal number of clusters using multiple methods. Molecular Ecology Resources, 18(1), 176-177.

Lischer, H. E., & Excoffier, L. (2012). PGDSpider: an automated data conversion tool for connecting population genetics and genomics programs. Bioinformatics, 28(2), 298-299.

Kopelman, N. M., Mayzel, J., Jakobsson, M., Rosenberg, N. A., & Mayrose, I. (2015). Clumpak: a program for identifying clustering modes and packaging population structure inferences across K. Molecular Ecology Resources, 15(5), 1179-1191.

Wang, J. (2019). A parsimony estimator of the number of populations from a STRUCTURE‐like analysis. Molecular Ecology Resources, 19(4), 970-981.

Wringe, B. F., Stanley, R. R., Jeffery, N. W., Anderson, E. C., & Bradbury, I. R. (2017). parallelnewhybrid: an R package for the parallelization of hybrid detection using newhybrids. Molecular Ecology Resources, 17(1), 91-95.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(26.2KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0232110.r003

Decision Letter 1

Roberto Fritsche-Neto

8 Apr 2020

EasyParallel: a GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

PONE-D-19-32193R1

Dear Dr. Zhao,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Roberto Fritsche-Neto, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: Thank you to accept my suggestions.

The only minor revision it is to change the logo at https://github.com/hzz0024/EasyParallel.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Filipe Inácio Matias

PLoS One. doi: 10.1371/journal.pone.0232110.r004

Acceptance letter

Roberto Fritsche-Neto

13 Apr 2020

PONE-D-19-32193R1

EasyParallel: a GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

Dear Dr. Zhao:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Professor Roberto Fritsche-Neto

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(26.2KB, docx)}

Data Availability Statement

The software, source code, user manual, and example data sets are available online from https://github.com/hzz0024/EasyParallel

[pone.0232110.ref001] 1.Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nature Reviews Genetics 17, 333 10.1038/nrg.2016.49 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref002] 2.Wang W, Zhang J, Sun M-T, Ku W-S (2017) Efficient parallel spatial skyline evaluation using MapReduce.

[pone.0232110.ref003] 3.Wang W, Zhang J, Sun M-T, Ku W-S (2019) A scalable spatial skyline evaluation system utilizing parallel independent region groups. The VLDB Journal—The International Journal on Very Large Data Bases 28, 73–98. [Google Scholar]

[pone.0232110.ref004] 4.Pritchard JK, Stephens M, Donnelly P (2000) Inference of population structure using multilocus genotype data. Genetics 155, 945–959. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref005] 5.Hargrove JS, Rogers MW, Kacmar PT, Black P (2019) A statewide evaluation of Florida Bass genetic introgression in Tennessee. North American Journal of Fisheries Management. [Google Scholar]

[pone.0232110.ref006] 6.Porras-Hurtado L, Ruiz Y, Santos C, et al. (2013) An overview of STRUCTURE: applications, parameter settings, and supporting software. Front Genet 4, 98 10.3389/fgene.2013.00098 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref007] 7.Silliman K (2019) Population structure, genetic connectivity, and adaptation in the Olympia oyster (Ostrea lurida) along the west coast of North America. Evolutionary applications 12, 923–939. 10.1111/eva.12766 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref008] 8.Thongda W, Lewis M, Zhao H, et al. (2019) Species-diagnostic SNP markers for the black basses (Micropterus spp.): a new tool for black bass conservation and management. Conservation Genetics Resources, 1–10. [Google Scholar]

[pone.0232110.ref009] 9.Thongda W, Zhao H, Zhang D, et al. (2018) Development of SNP panels as a new tool to assess the genetic diversity, population structure, and parentage analysis of the eastern oyster (Crassostrea virginica). Marine biotechnology 20, 385–395. 10.1007/s10126-018-9803-y [DOI] [PubMed] [Google Scholar]

[pone.0232110.ref010] 10.Zhao H, Fuller A, Thongda W, et al. (2019) SNP panel development for genetic management of wild and domesticated white bass (Morone chrysops). Animal genetics 50, 92–96. 10.1111/age.12747 [DOI] [PubMed] [Google Scholar]

[pone.0232110.ref011] 11.Evanno G, Regnaut S, Goudet J (2005) Detecting the number of clusters of individuals using the software STRUCTURE: a simulation study. Mol Ecol 14, 2611–2620. 10.1111/j.1365-294X.2005.02553.x [DOI] [PubMed] [Google Scholar]

[pone.0232110.ref012] 12.Anderson E, Thompson E (2002) A model-based method for identifying species hybrids using multilocus genetic data. Genetics 160, 1217–1229. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref013] 13.Besnier F, Glover KA (2013) ParallelStructure: AR package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers. PLoS One 8, e70651 10.1371/journal.pone.0070651 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref014] 14.Chhatre VE, Emerson KJ (2017) StrAuto: automation and parallelization of STRUCTURE analysis. BMC bioinformatics 18, 192 10.1186/s12859-017-1593-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0232110.ref015] 15.Pina‐Martins F, Silva DN, Fino J, Paulo OS (2017) Structure_threader: An improved method for automation and parallelization of programs structure, fastStructure and MavericK on multicore CPU systems. Molecular ecology resources 17, e268–e274. 10.1111/1755-0998.12702 [DOI] [PubMed] [Google Scholar]

[pone.0232110.ref016] 16.Wringe BF, Stanley RR, Jeffery NW, Anderson EC, Bradbury IR (2017) parallelnewhybrid: an R package for the parallelization of hybrid detection using newhybrids. Molecular ecology resources 17, 91–95. 10.1111/1755-0998.12597 [DOI] [PubMed] [Google Scholar]

PERMALINK

EasyParallel: A GUI platform for parallelization of STRUCTURE and NEWHYBRIDS analyses

Honggang Zhao

Benjamin Beck

Adam Fuller

Eric Peatman

Roles

Abstract

1. Introduction

2. Materials and methods

2.1 Overview

2.2 Parallel scheme

2.3 User-friendly GUI

Fig 1. A screenshot of EasyParallel running the STRUCTURE and NEWHYBRIDS analyses in parallel.

2.4 Execution time analyses

3. Results and discussion

Fig 2. Speed gain obtained by parallelization in EasyParallel and its comparison with Structure_threader and Parallelnewhybrid.

Table 1. Computational time (s) required to complete STRUCTURE and NEWHYBRIDS analyses in series compared to in parallel using EasyParallel, Structure_threader, and Parallelnewhybrid.

4. Conclusion

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Roberto Fritsche-Neto

Roles

Author response to Decision Letter 0

Decision Letter 1

Roberto Fritsche-Neto

Roles

Acceptance letter

Roberto Fritsche-Neto

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases