Abstract
Background
Lineage rate heterogeneity can be a major source of bias, especially in multi-gene phylogeny inference. We had previously tackled this issue by developing LS3, a data subselection algorithm that, by removing fast-evolving sequences in a gene-specific manner, identifies subsets of sequences that evolve at a relatively homogeneous rate. However, this algorithm had two major shortcomings: (i) it was automated and published as a set of bash scripts, and hence was Linux-specific, and not user friendly, and (ii) it could result in very stringent sequence subselection when extremely slow-evolving sequences were present.
Results
We address these challenges and produce a new, platform-independent program, LSX, written in R, which includes a reprogrammed version of the original LS3 algorithm and has added features to make better lineage rate calculations. In addition, we developed and included an alternative version of the algorithm, LS4, which reduces lineage rate heterogeneity by detecting sequences that evolve too fast and sequences that evolve too slow, resulting in less stringent data subselection when extremely slow-evolving sequences are present. The efficiency of LSX and of LS4 with datasets with extremely slow-evolving sequences is demonstrated with simulated data, and by the resolution of a contentious node in the catfish phylogeny that was affected by an unusually high lineage rate heterogeneity in the dataset.
Conclusions
LSX is a new bioinformatic tool, with an accessible code, and with which the effect of lineage rate heterogeneity can be explored in gene sequence datasets of virtually any size. In addition, the two modalities of the sequence subsampling algorithm included, LS3 and LS4, allow the user to optimize the amount of non-phylogenetic signal removed while keeping a maximum of phylogenetic signal.
Electronic supplementary material
The online version of this article (10.1186/s12859-019-3020-1) contains supplementary material, which is available to authorized users.
Keywords: Long branch attraction, Lineage rate heterogeneity, Phylogenomics, Phylogenetic methods, Sequence subsampling
Background
We recently showed that biases emerging from evolutionary rate heterogeneity among lineages in multi-gene phylogenies can be reduced with a sequence data-subselection algorithm to the point of uncovering the true phylogenetic signal [1]. In that study, we presented an algorithm called Locus Specific Sequence Subsampling (LS3), which reduces lineage evolutionary rate heterogeneity gene-by-gene in multi-gene datasets. LS3 implements a likelihood ratio test (LRT) [2] between a model that assumes equal rates of evolution among all ingroup lineages (single rate model) and another that allows three user-defined ingroup lineages to have independent rates of evolution (multiple rates model). If the multiple rates model fits the data significantly better than the single rate model, the fastest-evolving sequence, as determined by its sum-of-branch length from root to tip (SBL), is removed, and the reduced dataset is tested again with the LRT. This is iterated until a set of sequences is found whose lineage evolutionary rates can be explained equally well by the single rate or the multiple rates model. Gene datasets that never reached this point as well as the fast-evolving sequences removed from other gene alignments are flagged as potentially problematic [1]. LS3 effectively reduced long branch attraction (LBA) artifacts in simulated and biological multi-gene datasets, and its utility to reduce phylogenetic biases has been recognized by several authors [3, 4].
The published LS3 algorithm is executed by a set of Linux-specific bash scripts (“LS3-bash”). Here we present a new, re-written program which is much faster, more user-friendly, contains important new features, and can be used across all platforms. We also developed and included a new data subselection algorithm based on LS3, called “LS3 supplement” or LS4, which leads to lineage evolutionary rate homogeneity by removing sequences that evolve too fast and also those that evolve too slowly.
Implementation
The new program, LSX, is entirely written in R [5], and uses PAML [6] and the R packages ape [7, 8] and adephylo [9]. If PAML, R, and the R packages ape and adephylo are installed and functional, LSX runs regardless of the platform, with all parameters given in a single raw text control file. LSX reads sequence alignments in PHYLIP format and produces, for each gene, a version of the alignment with homogenized lineage evolutionary rates. In the new program LSX, the best model of sequence evolution can be given for each gene, thus improving branch length estimations, and users can select more than three lineages of interest (LOIs) for the lineage evolutionary rate heterogeneity test (Additional file 1: Figure S1a,b).
Within LSX we also implemented LS4, a new data subselection algorithm optimized for datasets in which sequences that evolve too fast and sequences that evolve too slow disrupt lineage rate heterogeneity. In such cases, the approach of LS3, which removes only fast-evolving sequences, can lead to the excessive flagging of data (Additional file 1: Table S1). This is because it will flag and remove sequences with intermediate evolutionary rates because they are still evolving “too fast” relative to the extremely slow-evolving ones (Additional file 1: Figure S2).
LS4 employs a different criterion to homogenize lineage evolutionary rates, which considers both markedly fast- and slow-evolving sequences for removal. Under LS4, when the SBLs for all ingroup sequences of a given gene are calculated, they are grouped by the user-defined LOI to which they belong. The slowest-evolving sequence of each LOIs is identified, and then the fastest-evolving among them across all ingroup lineages is picked as a benchmark (i.e. “the fastest of the slowest”, see Additional file 1: Figure S1c). Because in both LS3 and LS4 each LOI has to be represented by at least one sequence, this “fastest (longest) of the slowest (shortest)” sequence represents the slowest evolutionary rate at which all lineages could converge. Then, LS4 removes the ingroup sequence that produces the tip furthest from the benchmark, be it faster- or slower-evolving (Additional file 1: Figure S1d).
Results
We compared the efficiency of LSX relative to our previous script LS3-bash with simulated data (Additional file 1: Supplementary Methods), and found LSX to perform the LS3 algorithm 7× times faster than LS3-bash with a 100-gene dataset, and 8× faster with a 500-gene dataset (Additional file 1: Table S1). We then compared the relative effectiveness of LS4 and LS3 when analyzing datasets in which there were mainly average- and fast-evolving sequences, and datasets in which there were very slow-, average-, and very fast-evolving sequences (Additional file 1: Supplementary Methods). In the former case, both LS3 and LS4 gave similar results (Additional file 1: Table S1). In the latter case, which includes very slow and very fast-evolving sequences, the data subsampling under LS3 was too stringent and reduced substantially the phylogenetic signal, and only the data remaining after LS4 were able to clearly solve the phylogeny (Additional file 1: Table S1). In addition, we applied both algorithms, as implemented in LSX, to a biological case study: a 10-gene dataset of the catfish order Siluriformes [10]. There are two conflicting hypotheses for the most basal splits of this phylogeny: one proposed by morphological phylogenetics, and one proposed by molecular phylogenetics (e.g. [11, 12]). The point of conflict is the positioning of the fast evolving lineage Loricarioidei, which is closer to the root in molecular phylogenies than in the morphological phylogenies. The attraction of the fast evolving Loricarioidei lineage towards the root may be an artifact due to strong lineage rate heterogeneity, and allowed us to explicitly test the different approaches of LS3 and LS4.
Discussion
The results presented in [10] show that LS3 was able to find taxa subsets with lineage rate homogeneity in six out of the ten genes, and flagged four complete genes as unsuitable for analysis. Analyzing the LS3-processed dataset showed that the basal split of Siluriformes is indeed affected by lineage rate heterogeneity, and that there was a strong signal supporting the morphological hypothesis of the root. However, these results were not entirely satisfactory because one ingroup species was incorrectly placed among the outgroups, and one of the well-established clades of the phylogeny was not recovered. In contrast, LS4 found lineage rate homogeneity in seven out of the ten genes (only three genes were flagged), the final phylogeny showed the morphological hypothesis of the root, and all the ingroup taxa plus the well-established clades were recovered. In this case study, both LS3 and LS4 successfully mitigated the effect of lineage rate heterogeneity, but the data subselection criterion of LS4 allowed the inclusion of more data for the final analysis, and resulted in a phylogeny with better resolution.
Conclusions
The new program presented here, LSX, represents a substantial improvement over our initial scripts in LS3-bash. LSX is faster, platform-independent, the code is accessible, and also includes a new version of the algorithm, LS4. We show here and in a recent publication that this new version is more effective than LS3 in increasing the phylogenetic to non-phylogenetic signal ratio when extremely slow-evolving sequences are present in addition to very fast-evolving ones, and helped to solve a long-standing controversy of catfish phylogenetics. We also see a potential in both algorithms for scanning genome-wide datasets and using the gene flagging data to identify regions in which a single lineage shows a markedly accelerated evolution (such as human accelerated regions [13, 14]). Alternatively, the same data could also be used to identify genomic regions that are highly conserved (and thus slow-evolving) among some lineages but not others (e.g., conserved non-coding elements [15]). As research in phylogenetics progresses in the wake of the genomic era, we must begin to solve the most contentious nodes of the tree of life, where the usual methods may not be as effective. For undertaking these challenges we believe that accessible data subselection programs with clear criteria are a necessary tool, and should be made available whenever possible.
Availability and requirements
Project name: LSX v1.1.
Project homepage: https://github.com/carlosj-rr/LSx
Operating systems: Platform independent.
Programming language: R.
Other requirements: R 3.3.x or higher, R package ape 5.1 or higher (and dependencies), R package adephylo 1.1 or higher (and dependencies), PAML 4.
License: GNU GPL 3.0.
Any restrictions to use by non-academics: license needed.
Additional file
Acknowledgements
We thank Jose Nunes for his suggestions during the programming of LSX in R, and Joe Felsenstein for discussions about the criterion used in the LS4 algorithm.
Abbreviations
- LBA
Long branch attraction
- LOI
Lineages of interest
- LRT
Likelihood ratio test
- LS3
Locus specific sequence subsampling
- LS4
LS3 supplement
- SBL
Sum of branch lengths
Authors’ contributions
CJRR and JIMB developed the algorithms, CJRR did the initial code drafts, and finalized it with inputs from JIMB, and both authors wrote the manuscript. Both authors read and approved the final manuscript.
Funding
This work was supported by the Swiss National Science Foundation (grant 31003A_141233 to JIMB) and the Institute for Genetics and Genomics in Geneva (iGE3). The funding bodies had no role in the design of this study, its data collection and analysis, the interpretation of its data, nor in the writing of the manuscript.
Availability of data and materials
LSx.R, the LSX manual wiki, and example datasets are available at: https://github.com/carlosj-rr/LSx.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Footnotes
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
- 1.Rivera-Rivera CJ, Montoya-Burgos JI. LS3: a method for improving Phylogenomic inferences when evolutionary rates are heterogeneous among taxa. Mol Biol Evol. 2016;33:1625–34. doi: 10.1093/molbev/msw043. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Felsenstein Joseph. Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution. 1981;17(6):368–376. doi: 10.1007/BF01734359. [DOI] [PubMed] [Google Scholar]
- 3.Cruaud Astrid, Rasplus Jean-Yves. Testing cospeciation through large-scale cophylogenetic studies. Current Opinion in Insect Science. 2016;18:53–59. doi: 10.1016/j.cois.2016.10.004. [DOI] [PubMed] [Google Scholar]
- 4.Bleidorn Christoph. Phylogenomics. Cham: Springer International Publishing; 2017. [Google Scholar]
- 5.R Core Team. R: A language and environment for statistical computing. Vienna: R Found Stat Comput; 2016. https://www.r-project.org/.
- 6.Yang Z. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology and Evolution. 2007;24(8):1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
- 7.Paradis E., Claude J., Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004;20(2):289–290. doi: 10.1093/bioinformatics/btg412. [DOI] [PubMed] [Google Scholar]
- 8.Popescu Andrei-Alin, Huber Katharina T., Paradis Emmanuel. ape 3.0: New tools for distance-based phylogenetics and evolutionary analysis in R. Bioinformatics. 2012;28(11):1536–1537. doi: 10.1093/bioinformatics/bts184. [DOI] [PubMed] [Google Scholar]
- 9.Jombart Thibaut, Balloux François, Dray Stéphane. adephylo: new tools for investigating the phylogenetic signal in biological traits. Bioinformatics. 2010;26(15):1907–1909. doi: 10.1093/bioinformatics/btq292. [DOI] [PubMed] [Google Scholar]
- 10.Rivera-Rivera CJ, Montoya-Burgos JI. Back to the roots : reducing evolutionary rate heterogeneity among sequences gives support for the early morphological hypothesis of the root of Siluriformes ( Teleostei : Ostariophysi ) Mol Phylogenet Evol. 2018;127:272–279. doi: 10.1016/j.ympev.2018.06.004. [DOI] [PubMed] [Google Scholar]
- 11.Sullivan JP, Lundberg JG, Hardman M. A phylogenetic analysis of the major groups of catfishes (Teleostei: Siluriformes) using rag1 and rag2 nuclear gene sequences. Mol Phylogenet Evol. 2006;41:636–662. doi: 10.1016/j.ympev.2006.05.044. [DOI] [PubMed] [Google Scholar]
- 12.Diogo R. The Origin of Higher Taxa. 2007. 10.1093/acprof:oso/9780199691883.001.0001.
- 13.Bird Christine P, Stranger Barbara E, Liu Maureen, Thomas Daryl J, Ingle Catherine E, Beazley Claude, Miller Webb, Hurles Matthew E, Dermitzakis Emmanouil T. Fast-evolving noncoding sequences in the human genome. Genome Biology. 2007;8(6):R118. doi: 10.1186/gb-2007-8-6-r118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gittelman Rachel M., Hun Enna, Ay Ferhat, Madeoy Jennifer, Pennacchio Len, Noble William S., Hawkins R. David, Akey Joshua M. Comprehensive identification and analysis of human accelerated regulatory DNA. Genome Research. 2015;25(9):1245–1255. doi: 10.1101/gr.192591.115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Polychronopoulos Dimitris, King James W. D., Nash Alexander J., Tan Ge, Lenhard Boris. Conserved non-coding elements: developmental gene regulation meets genome organization. Nucleic Acids Research. 2017;45(22):12611–12624. doi: 10.1093/nar/gkx1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
LSx.R, the LSX manual wiki, and example datasets are available at: https://github.com/carlosj-rr/LSx.