Abstract
jModelTest is a Java program for the statistical selection of models of nucleotide substitution with thousands of users around the world. For large data sets, the calculations carried out by this program can be too expensive for many users. Here we describe the port of the jModeltest code for Grid computing using DRMAA. This work should facilitate the use of jModelTest on a broad scale.
Keywords: Model selection, Statistical Phylogenetics, DRMAA, Grid
Introduction
The estimation of evolutionary relationships between DNA sequences has important biological and biomedical implications. Nowadays, phylogenetic trees are used for example to predict gene function, to track tumor mutations or to monitor the geographical spread of pathogens. An essential step of the phylogenetic analysis is the selection of appropriate models of nucleotide substitution [1]. During the last years, different statistical selection strategies have been proposed [2] and several programs have been developed to carry out this task [3]. Among them, the most popular has been ModelTest [4], with more than 25,000 registered users, and recently superseded by its Java implementation, jModelTest [5]. However, the phylogenetic analysis of increasingly more common large sequence alignments implies important computational challenges and, in this regard, Grid computing can offer viable solutions.
1. jModelTest
jModelTest [5] is a Java program to carry out the statistical selection of best-fit models of nucleotide substitution (http://darwin.uvigo.es). In this context, the estimation of model scores, which implies likelihood optimization, is the main consuming step. In fact, these calculations are submitted to the program Phyml [6] in a serial fashion up to 88 times, which is the number of models implemented in jModelTest. These 88 calculations are completely independent and therefore ideal for Grid computing.
2. The DRMAA Grid version of jModelTest
jModelTest (the “Serial” version henceforth) was parallelized in two different steps. First it was adapted to the Distributed Resource Management Application (DRMAA) API [6] in order to be executed on different types of local clusters. Second, the resulting version (“Local DRMAA version”) was modified in order to be executed on Grid infrastructures (“Grid DRMAA version”). The final distributed release can be executed on most sites because it does not depend on specific libraries and/or capacities that could be available (or not) at particular locations.
2.1. DRMAA implementation
DRMAA is a high-level API specification for the submission and control of jobs to one or more sites within a Grid infrastructure. It allows for Grid execution in an unattended way: submission, monitoring and control of the jobs. Depending on the chosen scheduler, the same application can be executed on private resources (SGE, Condor) or on the Grid. The communication with DRMAA is through a Session object, which represents the operations available for interacting with the Distributed Resource Manager (DRM). In first place, the DRMAA version of jModelTest prepares the information for Phyml, gets the session and allocates a DRMAA job template, which submits the likelihood calculations to the DRM System (DRMS).
The input and output files are always available in a shared file system. jModelTest waits the end of all Phyml calculations using the session.synchronize method and then closes the session and parses the Phyml output files, and after getting the information all these copies are removed, as well as the log files if the process was successful. If the DRMAA version is run on a cluster with a high percentage of faults it would be better to use the session.wait method for waiting for each job finish and check the job status. If there was any problem is possible to submit the job again.
2.2. Grid implementation
By employing DRMAA, it is ensured that the code can be seamlessly coupled to different schedulers. In this case, we chose GridWay [7]. A build.xml file was created to manage the compilation process executed by ant, a Java-based make tool. The GridWay DRMAA library location was supplied in an environment variable DRMAA, so jModelTest callings to these functions are performed by the aforementioned library, and then by the chosen scheduler. The jModelTest application itself acts as front-end for the final user, since GridWay works transparently.
At the beginning of the execution, jModelTest creates a DRMAA job template for each task that is submitted to the DRMS. This template depicts the characteristics of the task to be executed and indicates which files have to be copied to the remote sites and which should be brought back as partial results (no storage is requested after the calculations are finished. The retrieval of information is determined by the Grid infrastructure middleware. Also, as part of the DRMAA execution, a shell script file is employed to call the main executable with the desired input parameters, depicted in the DRMAA template. For Grid computing, this file was slightly modified in order to check for proper permissions of the remote executable and to rename the input files according to the executable requirements. For the local DRMAA version, it was necessary to have temporal instances of the input file for each task, to avoid overwriting.
3. Evaluation of performance
The performance of the different versions of jModelTest was measured comparing the running times necessary to calculate the likelihood scores for 88 models in two small real data sets (“HIV”: 8 sequences 3009 nucleotides long, HIV-1 polimerase gene; “Yeast”: 8 sequences 127060 nucleotides long representing 106 yeast genes).
The “Serial” and “Local DRMAA” versions of the code were evaluated in the CESGA cluster SVGD, with 36 double processors quad-core Intel Xeon 5310 (1.6GHz, 4 GB) and 4 double processors quad-core Intel Xeon 5355 (2.66GHz, 8 GB). The SVGD cluster is a system that decreases the priority of the jobs as a user has an increasing number of jobs running, so the last submissions wait for longer in the queue. The Grid DRMAA version was evaluated on the biomed VO of the EGEE Project infrastructure. This is a real production environment with diversity of resources (cores from 1 GHz to 3.2, for example). Its status can be monitored in real time from http://www3.egee.cesga.es/.
When a short queue time was present, the runtime of the “Serial” version for the HIV case was 12.22 minutes and the “Local DRMAA” one was 2.40 minutes, i.e. a speed-up of 4.63. On the other hand, when the system workload was big, so the queue time is greater, the run time was 5.29 minutes (speed-up of 2.25). Under the ideal situation, without additional jobs in the cluster so all the 88 jobs ran in parallel, the speed-up was 18.09 (limited by the execution time of the most consuming model calculation).
With the Grid DRMAA version and using only resources at CIEMAT (ceeela.ciemat.es and lcg02.ciemat.es), the execution time was 5.83 minutes, representing a speedup of 2.1. When a wider set of resources at EGEE were employed (10 sites), the execution time was 5 minutes, with a speedup of 2.44.
For the larger Yeast dataset the runtime was over 41 hours with the “Serial” version, 4.5 hours with the local DRMAA version (light cluster usage) and 4.6 hours with the Grid DRMAA version (average wall-time for three executions), corresponding to speedups of 9.1 and 8.9 respectively. For the calculation of these numbers, both queue and execution times were taken into account as can be seen in Figs. 1 and 2. It is also worth mentioning that GridWay was programmed in a way that any job only waited for 10 minutes in a queued status, being migrated to other resource after that. In this case the speedups increased due to a bigger Run/Queue Time ratio.
Figure 1.
Local DRMAA execution of jModelTest: heavy cluster usage (yeast genes)
Figure 2.
Grid DRMAA execution of jModelTest (yeast genes)
In the SVGD cluster, where the available slots were less than the number of tasks to be executed, the queue time increased, since the last tasks had to wait for the previous ones to finish before being executed. In the case of the Grid DRMAA version, where the number of available slots was higher than the required number, the queue time remained constant. Because of this, the potential drawback in terms of quality of resources that a Grid infrastructure might have (like here, compared to the SVDG) can be overcome by an increased degree of parallelism, reducing execution time.
Although most jobs can have a short queue time, there can be some cases in which this queue time can raise up to 10 or 50 times the average one. This is so because a job is successfully queued in a resource, but the CPU is still busy executing a job. When this is the case or any other problem occurs at any point of the remote execution, the job has to be submitted to another site and executed again. By using GridWay and setting a queue time limitation of 10 minutes, all these steps are automatic, showing the convenience of GridWay to this particular problem, and representing an asset when the user submits jobs in an opportunistic way
At this point, it is important to bear in mind that the use cases, especially the first one, were small from the biological point of view. As it can be seen in the summary displayed in Table 1, the larger the problem is, the better the speedup is with the different DRMAA versions. Also, the comparison between the local and Grid DRMAA versions depends on the system workload of the cluster; as the cluster usage increases, the advantages of the local version fades away.
Table 1.
Speedups obtained in this experiment (SVGD cluster with a light usage)
| Use case | Speedup | |||
|---|---|---|---|---|
| Organism | DNA sequences | Nucleotides | SVGD cluster | EGEE Grid infrastructure |
| HIV | 8 | 3009 | 4.63 | 2.1 |
| Yeast | 8 | 127060 | 9.1 | 8.9 |
Regarding the overhead, in the case of the local DRMAA version, the induced overhead was due to the hardware limitations of the SVGD cluster; the existence of a limited number of nodes established an upper bound to the degree of parallelism, so the queue time grew with the number of tasks submitted. With respect to the Grid Infrastructures the addition of several hardware and software layers creates an additional overhead that is produced by the copy of the input data and application executable to the remote sites (less than 1 MB) and the retrieval of the partial results. These two overheads are parallelizable and scale linearly. An additional overhead is due to GridWay since it dispatches 15 jobs every 30 seconds. With jModelTest creating 88 parallel tasks, this overhead has an upper bound of 180 seconds, what is considered acceptable -especially when executing longer, more CPU-demanding runs.
4. Conclusions
Grid has emerged as a powerful platform for facing new and more ambitious problems. In addition, it has enabled easier access for the scientific community to large computing resources beyond supercomputers. Model selection is an important step in statistical phylogenetics, where jModelTest has become de facto program for addressing this task. With the new DRMAA Grid version of the jModelTest program, we expect to increase the number of researchers using this tool and allow processing large datasets in a shorter time. New releases based on the pilot-jobs system where NAT capabilities were available is planned for the future in order to test a possible improvement in the execution of jobs.
References
- 1.Posada D, Buckley TR. Model selection and model averaging in phylogenetics: advantages of Akaike information criterion and Bayesian approaches over likelihood ratio tests. Syst Biol. 2004;53:793–808. doi: 10.1080/10635150490522304. [DOI] [PubMed] [Google Scholar]
- 2.Posada D, Crandall KA. Selecting the best-fit model of nucleotide substitution. Syst Biol. 2001;50:580–601. [PubMed] [Google Scholar]
- 3.Nylander JA. The MrAIC.pl program, distributed by the author. Evolutionary Biology Centre, Uppsala University; [Google Scholar]
- 4.Posada D, Crandall KA. Modeltest: testing the model of DNA substitution. Bioinformatics. 1998;14:817–818. doi: 10.1093/bioinformatics/14.9.817. [DOI] [PubMed] [Google Scholar]
- 5.Posada D. jModelTest: Phylogenetic Model Averaging. Mol Biol Evol. 2008;25:1253–1256. doi: 10.1093/molbev/msn083. [DOI] [PubMed] [Google Scholar]
- 6.Tröger P, Rajic H, Haas A, Domagalski P. Standardization of an API for Distributed Resource Management Systems. Proc Seventh IEEE Int Symposium on Cluster Computing and the Grid; 2007. pp. 619–626. [Google Scholar]
- 7.Huedo E, Santiago RS, Llorente IM. A framework for Adaptative Sheduling in Grids. Software-Practice & Experience. 2004;34:631–651. [Google Scholar]


