Abstract
Genome-scale data have greatly facilitated the resolution of recalcitrant nodes that Sanger-based datasets have been unable to resolve. However, phylogenomic studies continue to use traditional methods such as bootstrapping to estimate branch support; and high bootstrap values are still interpreted as providing strong support for the correct topology. Furthermore, relatively little attention has been given to assessing discordances between gene and species trees, and the underlying processes that produce phylogenetic conflict. We generated novel genomic datasets to characterize and determine the causes of discordance in Old World treefrogs (Family: Rhacophoridae)—a group that is fraught with conflicting and poorly supported topologies among major clades. Additionally, a suite of data filtering strategies and analytical methods were applied to assess their impact on phylogenetic inference. We showed that incomplete lineage sorting was detected at all nodes that exhibited high levels of discordance. Those nodes were also associated with extremely short internal branches. We also clearly demonstrate that bootstrap values do not reflect uncertainty or confidence for the correct topology and, hence, should not be used as a measure of branch support in phylogenomic datasets. Overall, we showed that phylogenetic discordances in Old World treefrogs resulted from incomplete lineage sorting and that species tree inference can be improved using a multi-faceted, total-evidence approach, which uses the most amount of data and considers results from different analytical methods and datasets.
Keywords: incomplete lineage sorting, anomaly zone, bootstrap, parsimony informative sites, concordance factor, branch support
1. Introduction
One of the main foci of phylogenomics is to elucidate ambiguous relationships that Sanger-derived markers were unable to resolve. For Sanger-based phylogenetic analysis, the gold standard for resolving relationships is to obtain strong branch support, usually measured in the form of high bootstrap proportions or Bayesian posterior probabilities [1,2]. However, phylogenomic studies have demonstrated that these measures do not necessarily reflect confidence for the correct species tree topology. High branch support can be positively misleading and adding orders of magnitude more data can counterintuitively increase spurious support for the wrong tree [3–5]. This is corroborated by in-depth examinations of genome-scale data that have revealed high levels of gene tree discordance [6–8] and well-supported but conflicting evolutionary histories among different data partitions, portions of the genome or classes of markers [9,10].
Species tree inference using genomic data can be biased by systematic errors resulting from insufficient informative data leading to unresolved gene trees [11], different data filtering strategies or datatypes [7,12–15], and the implementation of different analytical methods such as concatenation versus coalescent methods [16–18]. Although various analytical strategies have been implemented to reduce systematic error, a best-practice consensus has yet to be reached, indicating a lack of understanding on how to effectively analyse large genomic datasets, thereby hampering our ability to fully harness the power of genome-scale data.
Discordance between gene trees and the species tree can also result from biological processes such as introgression, gene duplication and loss, horizontal gene transfer and incomplete lineage sorting (ILS) [19–21]. ILS occurs when ancestral polymorphisms do not reach fixation between successive divergence events. This can occur during periods of rapid diversification, particularly when effective population size is large relative to its associated branch length [22]. One of the adverse effects of extreme ILS is that the most common gene trees may not be congruent with the species tree topology. In such instances, the tree-space where these so-called anomalous gene trees predominate is known as the anomaly zone [23]. Consequently, species tree inference in the anomaly zone can be challenging due to insufficient amounts of informative sites at short branches [24], stochastic mutational variance [25] and inappropriate methods that do not account for ILS [16]. Although the conceptual basis of the anomaly zone has been well established, its occurrence and effects on empirical systems have rarely been demonstrated [26,27]. As such, it is important for empirical phylogenomic studies to examine the numerous underlying factors that could potentially bias species tree inference [7,28]. A multifarious approach that thoroughly explores potential biases, considers ameliorative steps and assesses the consistency of results across different treatments of data and comparisons of analytical methods should therefore be the new benchmark for phylogenomic studies.
Old World treefrogs of the family Rhacophoridae consist of 428 species that are widely distributed across Asia and Southeast Asia, with a disjunct occurrence in Africa [29]. Although this charismatic family has been the focus of many phylogenetic studies, the relationships of several major clades have yet to be unambiguously resolved [30–33]. In particular, the placements of the genera Gracixalus, Philautus, Feihyla, Polypedates and Rhacophorus sensu lato (s.l.) have never been completely resolved [33–36]. Additionally, placement of the genus Nyctixalus has, alternatively, been inferred with strong support in two different topological positions [37–39]. All phylogenetic studies on rhacophorids have so far focused on obtaining strong branch support. By contrast, no studies have been performed to assess the robustness of inferred relationships (despite ostensibly strong support) and causes of phylogenetic discordance, which can be used to provide a deeper understanding on the evolutionary processes that affect species tree inference. As such, we employed a multifarious approach using a large and diverse set of phylogenomic markers consisting of exons, introns, UCEs and a set of 30 Sanger-based markers generated using FrogCap [40], to compare the consistency of inferred relationships among rhacophorid genera. We also assessed the effects of various data filtering strategies and analytical methods, in addition to an in-depth examination of gene tree and branch length variation, in hopes of determining underlying causes of phylogenetic discordance.
2. Materials and methods
(a). Taxon sampling and DNA extraction
Datasets consist of five outgroup (Arthroleptis variabilis, Scaphiophryne marmorata, Cornufer guentheri, Boophis tephraeomystax and Mantidactylus melanopleura) and 45 ingroup samples. The ingroup is represented by 10 of the 18 known genera: Nyctixalus, Theloderma, Gracixalus, Kurixalus, Raorchestes, Philautus, Chiromantis, Feihyla, Polypedates and Rhacophorus. The outgroup comprises taxa from the family Mantellidae (sister to Rhacophoridae), followed by taxa from progressively distant families such as Ceratobatrachidae, Arthroleptidae and Microhylidae [41]. Tissue samples for molecular work were obtained from the museum holdings of the University of Kansas (KU), California Academy of Sciences (CAS), La Sierra University Herpetological Collection, Riverside, California (LSUHC), Field Museum of Natural History (FMNH), Australian Museum (AMH) and the Museum of Vertebrate Zoology, Berkeley (MVZ). The list of samples and their associated metadata are presented in electronic supplementary material, table S1.
(b). Bioinformatics
Details on library preparation, sequencing and bioinformatics are provided in the electronic supplementary material. Alignments were generated and processed using custom scripts (https://github.com/chutter/FrogCap-Sequence-Capture) and saved separately into usable datasets for phylogenetic analyses and data type comparisons: (i) introns (exons were trimmed from the original contig and the two remaining intronic regions were concatenated); (ii) exons (each alignment was adjusted to be in an open-reading frame and trimmed to the largest reading frame that accommodated greater than 90% of the sequences); (iii) UCEs; and (iv) ‘legacy’, commonly used nuclear markers from amphibian phylogenetic studies (full list of markers can be obtained from [40]).
(c). Data filtering and phylogenetic analysis
To assess potential biases arising from different datatypes and data filtering strategies, we analysed all datasets separately. Because markers with missing taxa and low phylogenetic information can potentially increase gene tree estimation error [11], each dataset was filtered at 75% taxon sampling completeness (markers with 25% or more missing data were discarded) and 50% parsimony informative sites (PIS; bottom 50% of markers with the least PIS were discarded). Data filtering was not performed on the legacy dataset due to insufficient numbers of markers. Finally, we performed a total-evidence analysis on a combined dataset consisting of unfiltered exon, intron and UCE markers (all-combined). Sampling completeness and PIS were calculated using the summary function in the program AMAS [42].
Phylogenetic estimation was performed using concatenation and summary coalescent-based methods. For the analysis of concatenated data, we used the program IQ-TREE v. 1.7 [43] to perform two sets of analyses on each dataset. The first was an unpartitioned analysis using the GTR + GAMMA substitution model, while the second was a partitioned analysis (partitioned by marker) with the GTR + GAMMA model applied to each partition. We applied a parameter-rich substitution model for both analyses (as opposed to model testing) due to computational limitations, and because this strategy has been shown to produce similar, if not better results for large phylogenomic datasets [44,45]. Branch support was assessed using ultrafast bootstrapping (1000 replicates) [46]. We also tested for saturation at the first, second and third codon positions using the index for substitution saturation implemented in the program DAMBE7 [47]. To overcome the issue of saturation, we conducted an IQ-TREE analysis on the amino acid alignment of the unfiltered exon dataset.
We performed a summary-based species tree analysis using ASTRAL-III [48]. IQ-TREE was used to estimate gene trees for each individual marker. To assess the effects of model selection on gene tree and species tree inference, we implemented two strategies. First, gene trees were estimated using the best-fit substitution model for each individual marker as determined by the program ModelFinder [49]. Second, we applied a parameter-rich model (GTR + GAMMA) on all gene trees. In both analyses, the best ML tree was retained as input for the ASTRAL analysis. We also implemented the site-based method SVDQuartets [50] that bypasses the use of gene trees altogether. The Quartet Fiduccia–Mattheyses (QFM) assembly algorithm was implemented without local searching. Topology tests were performed in IQ-TREE to assess the fit of inferred topologies against the all-combined alignment (additional details in the electronic supplementary material).
(d). Branch support and congruence
In addition to bootstrap (IQ-TREE) and posterior probabilities (ASTRAL), we used gene (gCF) and site concordance factor (sCF) to investigate topological conflict around each branch of the species tree. For every branch of the species tree, the gCF and sCF represents the percentage of decisive gene trees and alignment sites, respectively, containing that branch [51]. Assuming that the only source of discordance between gene and species trees is ILS, the multi-species coalescent model predicts that the probability of a gene tree quartet matching the species tree topology is higher than the probability of matching the two alternatives. Additionally, the two alternatives will have similar frequencies [52,53]. Using this expectation, we calculated the relative frequency of branch quartets surrounding focal clades. We then used a χ2-test to determine whether the frequency of gene trees (gCF) and sites (sCF) supporting the two alternate topologies was significantly different [54]. Non-significant p-values (p > 0.05) indicate a failure to reject the hypothesis of equal frequencies, implying that discordance among gene trees and/or sites is likely due to ILS. Concordance factors were calculated in IQ-TREE [51] and the χ2-test was performed in R using scripts available here: http://www.robertlanfear.com/blog/files/concordance_factors.html. Because ILS is often associated with rapid radiations, we tested focal branch lengths (d, in coalescent units) for polytomies (d = 0), where all three frequencies are expected to be close to one-third if polytomous [55].
Short internal branches in a species tree are also in danger of being in the anomaly zone and can generate gene tree topologies that do not match the species tree [23]. We followed the approach by [26,56] to estimate the presence of the anomaly zone in four-taxon subclades by comparing consecutive, ancestor and descendant, internal branch lengths. If these sets of internodes were in the anomaly zone, at least one anomalous gene tree would be expected to be present. Relative branch quartet frequency analyses were performed using the program DiscoVista [53]. Polytomy tests were performed and branch lengths (in coalescent units) were obtained with ASTRAL-III.
3. Results
(a). Genomic data
After quality filtering, loci matching and alignment, a total of 13 221 loci were obtained, including 12 370 exons, 12 257 introns, 650 UCEs and 30 legacy loci after trimming (table 1). On average, legacy markers were longest, followed by UCEs, introns and exons. Intron datasets had the most characters and the highest number of variable sites and PIS. Although UCEs had similar average PIS compared to introns, the average proportion of PIS per marker was approximately half that of introns (table 1).
Table 1.
dataset | number of markers (mean length) | base pairs | variable sites |
PIS |
||||
---|---|---|---|---|---|---|---|---|
total | mean | mean prop. | total | mean | mean prop. | |||
unf-exon | 12 370 (208) | 2 579 130 | 1 173 581 | 95 | 0.43 | 767 630 | 62 | 0.27 |
mis75-exon | 8328 (229) | 1 904 991 | 902 898 | 108 | 0.44 | 609 565 | 73 | 0.30 |
pis50-exon | 6185 (277) | 1 715 499 | 863 701 | 140 | 0.50 | 582 889 | 94 | 0.33 |
unf-intron | 12 257 (424) | 5 191 144 | 4 443 393 | 362 | 0.86 | 3 387 639 | 276 | 0.65 |
mis75-intron | 7558 (452) | 3 414 409 | 2 981 322 | 394 | 0.88 | 2 364 933 | 313 | 0.70 |
pis50-intron | 6129 (531) | 3 256 482 | 2 887 746 | 471 | 0.89 | 2 283 964 | 373 | 0.71 |
unf-UCE | 650 (744) | 483 713 | 247 439 | 381 | 0.50 | 150 023 | 231 | 0.30 |
mis75-UCE | 533 (777) | 414 172 | 217 542 | 408 | 0.50 | 136 055 | 255 | 0.32 |
pis50-UCE | 325 (878) | 285 325 | 163 895 | 504 | 0.58 | 106 704 | 328 | 0.38 |
legacy | 30 (1150) | 34 506 | 13 046 | 435 | 0.35 | 8405 | 280 | 0.22 |
(b). Phylogenetic inference
Saturation tests showed non-significant saturation at all three codon positions (electronic supplementary material, table S2 and figure S1). Partitioned and unpartitioned IQ-TREE analyses yielded congruent topologies. Similarly, ASTRAL species tree topologies derived from gene trees estimated using either model testing or GTR + GAMMA produced congruent topologies. Insignificant changes in branch lengths and branch support were detected in both comparisons (partitioned versus unpartitioned; model testing versus GTR + GAMMA).
Overall, five different topologies (referred to as T1–T5; figure 1) were inferred across our three classes of phylogenetic analyses and 11 datasets. IQ-TREE and ASTRAL produced congruent topologies except for the exon datasets (table 2). The exon datasets produced variable results and inferred either T1 or T3 depending on the type of analytical method and filtering scheme. SVDQuartets consistently inferred T5 across all datasets (figure 1 and table 2). The topology tests showed that T1 was the optimal topology, while topologies T2–T5 were strongly rejected (electronic supplementary material, table S3).
Table 2.
topology |
N1 |
N2 |
N3 |
||||||
---|---|---|---|---|---|---|---|---|---|
dataset | IQ-TREE | ASTRAL-III (QS) | SVDQ | BS/PP | AZ | BS/PP | AZ | BS/PP | AZ |
all-combined | T1 | T1 (0.71) | T5 | 100/1 | y | 100/1 | y | 100/1 | y |
unf-exon | T3 | T1 (0.68) | T5 | 100/1 | y | 100/1 | y | 100/1 | y |
mis75-exon | T3 | T1 (0.68) | T5 | 100/1 | y | 100/1 | y | 100/0.75 | y |
pis50-exon | T3 | T3 (0.71) | T5 | 100/1 | y | 100/1 | y | 100/0.23 | y |
unf-intron | T2 | T2 (0.76) | T5 | 100/1 | y | 100/1 | y | 100/1 | y |
mis75-intron | T2 | T2 (0.76) | T5 | 100/1 | y | 100/1 | y | 100/1 | y |
pis50-intron | T2 | T2 (0.77) | T5 | 100/1 | y | 100/1 | y | 100/1 | n |
unf-UCE | T1 | T1 (0.77) | T5 | 100/1 | y | 99/0.67 | y | 60/0.89 | n |
mis75-UCE | T1 | T1 (0.77) | T5 | 99/1 | y | 93/0.63 | y | 72/0.91 | n |
pis50-UCE | T1 | T1 (0.80) | T5 | 99/1 | y | 98/0.81 | y | 76/0.79 | n |
legacy | T4 | T4 (0.81) | NA | 99/1 | n | 59/0.76 | n | 53/0.43 | y |
(c). Phylogenetic discordance
Discordance across the five topologies revolved around three nodes (N1–N3; figure 1). A comparison between bootstrap and concordance factors showed that there was no relationship between bootstrap support and congruence (gCF and sCF; figure 2a). All focal nodes had high bootstrap values despite high levels of discordance, as indicated by low gCF and sCF values (figure 2a) and there was no correlation between branch length and its associated bootstrap value (figure 2b). However, there was a strong positive correlation between branch length and concordance factors (figure 2c,d). Gene concordance did not significantly improve when data was filtered by taxon completeness but showed marginal improvements when filtered by PIS (electronic supplementary material, figure S2).
The relative frequency of gene trees at N1 revealed that most gene trees supported the T1 topology across all but the legacy dataset, which had a significantly larger proportion of gene trees supporting the T4 topology (figure 3). At N2, equal proportions of gene trees supported the T1 and T2 topologies in the all-combined dataset. Most gene trees supported the T1 topology in the exon, UCE and legacy datasets, whereas only 2% more gene trees supported the T2 topology across all intron datasets. At N3, exon datasets inferred either the T1 or T3 topologies depending on the filtering scheme and the analytical method. For exon datasets filtered at 50% PIS and 75% completeness, the T3 topology was inferred when analysed using IQ-TREE, but the T1 topology was inferred when analysed with ASTRAL (table 2). The relative frequency analysis of those datasets revealed that equal proportions of gene trees supported either topology. For the unfiltered exon dataset, most gene trees supported the T1 topology, but the IQ-TREE analysis inferred the T3 topology. Node 3 was further revealed as a polytomy in all exon datasets (polytomies were not detected in the other datasets).
The χ2-test failed to reject (p > 0.05) the hypothesis of equal gene tree frequencies (gEF) for most datasets at N1 but rejected (p < 0.05) gEF for most datasets at N2 and N3. The hypothesis of equal site frequencies (sEF) could not be rejected across most datasets at N1 and N3 but was rejected across most datasets at N2 (electronic supplementary material, table S3). Anomaly zone calculations showed that the focal nodes N1–N3 were in the anomaly zone for most datasets except N2 for the legacy dataset and N3 for all UCE datasets and one intron dataset filtered for PIS (table 2).
4. Discussion
(a). Impact of model selection and data filtering on phylogenetic inference
Studies have showed that unpartitioned analyses of concatenated data can be statistically inconsistent in the presence of ILS and produce erroneous topologies [18,45,57]. However, our results from partitioned and unpartitioned analyses inferred similar topologies across all datasets, indicating that the differential effects of these methods are still not fully understood. ASTRAL species tree topologies were also congruent when gene trees were estimated using model testing or GTR + GAMMA, corroborating previous studies that demonstrated the insignificant impacts of model testing on species tree inference when large amounts of data are used [44,45]. These studies, including ours, show that applying a parameter-rich substitution model such as GTR + GAMMA in lieu of model testing is a viable strategy that can significantly reduce the computation time and load of large phylogenomic analyses.
Within each class of data (intron, exon and UCE), the implementation of different filtering strategies did not alter the final topology or significantly affect branch support. Although the final topology was unchanged, filtering by PIS was clearly more effective at improving concordance compared to taxon completeness (as assessed by ASTRAL quartet scores; table 2). Interestingly, at N2, the majority of gene trees across all intron datasets supported a suboptimal topology. This mismatch between the most probable gene tree topology (T2; figure 3) and the optimal species tree (T1) indicates that introns could be more susceptible to the adverse effects of ILS compared to exons or UCEs. Another noteworthy result is that all UCE datasets inferred the optimal topology despite containing substantially less markers compared to introns and exons (weaker branch support is presumably due to the fewer number of markers). Quartet scores for UCE datasets were also higher compared to introns and exons, indicating that UCE markers may be less affected by ILS (table 2). In agreement with [45], we showed that data filtering does not confer appreciable benefits in terms of species tree inference and that analyzing more (unfiltered) data produces the best results. Surprisingly, the SVDQuartets analysis inferred a novel topology across all datasets (T5) which was not supported by the topology test, possibly due to high ILS and large numbers of sites per locus that is known to affect the accuracy of this analysis [58].
(b). Assessing discordance in phylogenomic datasets
Our study also empirically demonstrates that traditional bootstrapping is not an appropriate measure to assess branch support. Bootstrap values were not correlated with topological concordance [51] and routinely produced strong support at highly discordant nodes. This is because resampling procedures, such as non-parametric bootstrapping, essentially measure site-sampling variance as opposed to observed variance in the data. Because site-sampling variance is an inverse function of sample size (amount of data), bootstrap values will concomitantly increase with the amount of analysed data [4,59], this tendency does not necessarily reflect variation in the data themselves. Although it has long been known that the traditional bootstrap cannot be interpreted as the probability of a branch belonging to the true species tree and is biased by the total number of characters in the data [60], it remains one of the most widely used measures of branch support, even in the phylogenomic era. Hence, it is worth reiterating here that traditional bootstrapping should not be used as a measure of branch support in large phylogenomic datasets. Our study clearly shows that while low bootstrap support does indeed reflect uncertainty, high values can be positively misleading. Other measures such as local posterior probabilities and concordance factors provide a more accurate representation of topological variation and uncertainty in genomic datasets. However, higher concordance values can also be an artefact of small datasets [51] (our legacy datasets produced the highest overall concordance factors; electronic supplementary material, figure S2). Therefore, our results indicate that traditional bootstrapping is more appropriate for small datasets compared to concordance factors, while the converse should be applied for larger, genome-scale data.
Our results also showed that discordance was strongly correlated with branch length, where nodes with the highest discordance also had the shortest branches (figure 2). Similar to results from other phylogenomic studies, we showed that resolving very short internodes resulting from consecutive rapid diversification events remains a challenge, even with the availability of orders of magnitude more data than ever before [5,28,61].
(c). Causes of phylogenomic discordance in Rhacophoridae
All discordant nodes were associated with extremely short internal branches (figure 2), indicating that rapid diversification events were the likely cause of ILS. We further showed that discordant nodes were in the anomaly zone for most of the tested datasets, which explains the high incidence of anomalous gene trees (most probable gene trees that do not match the underlying species tree; figure 3). However, although present in most of our datasets, ILS had distinct effects on different nodes, datasets and analysis. At N1, only the legacy dataset inferred a different topology (T4), whereas all genomic datasets consistently inferred the T1 topology, indicating that the optimal topology was able to be inferred by increasing the amount of data. However, despite using ∼580 k–3.3 million PIS, exon and intron datasets inferred conflicting topologies with strong support at N2 and N3. At N2, the majority of gene trees across all intron datasets strongly supported the T2 topology, while SVDQuartets inferred the T5 topology, indicating that discordance at N2 could be a result of both ILS and gene tree estimation error. Although the exon datasets inferred the T3 topology at N3 (all other datasets and analyses inferred T1), the gene tree frequency analysis and polytomy test revealed that N3 was a polytomy across all exon datasets, which would explain the incongruence surrounding that node for those datasets. However, we were unable to determine whether the node constituted a hard or soft polytomy. Notwithstanding ILS, different classes of data are also likely undergoing different selection pressures [62,63], which could contribute to the conflicting but strongly supported alternate topologies.
Another possible cause of discordance is hidden paralogy stemming from ancestral genome or gene duplications and subsequent losses in different gene copies resulting in single gene copies that do not reflect its genealogy [64,65]. We assessed hidden paralogy (described in electronic supplementary material) and did not uncover evidence of genome duplications nor large amounts of gene duplications in Rhacophoridae. Instead, most detectable paralogues were novel gene duplications in extant taxa not shared with other taxa (electronic supplementary material, table S5). Because of the small number of paralogues overall, hidden paralogy is unlikely to drive widespread patterns of discordance. However, without complete genomes, we were unable to rule out completely the possibility that hidden paralogy could be present, such as in cases where a duplication occurs and different duplicates are lost resulting in a single copy for all extant species.
Overall, our systematic analyses of different classes of data support the hypothesis that ILS (caused by rapid diversification events) is likely one of the dominant underlying factors responsible for most of the deep-level discordances in Old World treefrogs. Thus, our ability to establish direct links between phylogenetic uncertainty, ILS and the anomaly zone provides valuable insights into how underlying microevolutionary processes can complicate species tree estimation, despite using large amounts and different classes of genomic data.
Supplementary Material
Acknowledgements
Illumina sequencing was partially funded by NSF support to R.M.B. (DEB 1654388) and our fieldwork was supported by DEB 0743491 and 1557053. We thank La Sierra University, The Field Museum of Natural History (Chicago), Museum of Vertebrate Zoology (Berkeley, California), California Academy of Sciences and Jodi Rowley (Australian Museum) for providing tissues. This paper is contribution number 932 of the Auburn University Museum of Natural History. We thank the Philippine Department of the Environment and Natural Resources, the Solomon Islands Forestry Department, Madagascar's Ministry of the Environment, Forests and Tourism, and Thailand's Ministry of Natural Resources and Environment for issuing collecting and export permits to R.M.B. and colleagues at KU. We thank the Malagasy authorities for approving research permits (no. 298/13/MEF/SG/DGF/DCB.SAP/SCBSE, no. 303/14/MEEMF/SG/DGF/DAPT/SCBT, no. 329/15/MEEMF/SG/DGF/DAPT/SCBT); specimens were exported under permits (no. 017N-EV01/MG14, no. 055NEA02/MG15, no. 041N-EA01/MG16).
Data accessibility
Raw sequence reads are available at the GenBank SRA: BioProject PRJNA633673 (outgroup samples); BioProject PRJNA659075 (ingroup samples; raw sequences to be uploaded upon acceptance of manuscript). FrogCap bioinformatic scripts are available at: https://github.com/chutter/FrogCap-Sequence-Capture. Custom scripts used for analyses are available at: https://github.com/chankinonn/Phylogenomic-scripts. All relevant alignments, gene trees, consensus/species trees and partitioning files that are required to reproduce this study have been uploaded to the Dryad Digital Repository: https://doi.org/10.5061/dryad.8cz8w9gn7 [66].
Authors' contributions
K.O.C. designed the study, carried out laboratory molecular work, performed analysis and wrote the manuscript; C.R.H. designed the probeset and bioinformatics pipeline, provided bioinformatic support and critically revised the manuscript; P.L.W. helped with molecular laboratory work and revised the manuscript; L.L.G. provided tissue samples; R.M.B. critically revised the manuscript and provided funding for molecular work.
Competing interests
We declare we have no competing interests.
Funding
We received no funding for this study.
References
- 1.Hillis DM, Bull JJ. 1993. An empirical test of bootstrapping as a method for assessing confidence in phylogenetic analysis. Syst. Biol. 42, 182–192. ( 10.1093/sysbio/42.2.182) [DOI] [Google Scholar]
- 2.Wilcox TP, Zwickl DJ, Heath TA, Hillis DM. 2002. Phylogenetic relationships of the dwarf boas and a comparison of Bayesian and bootstrap measures of phylogenetic support. Mol. Phylogenet. Evol. 25, 361–371. ( 10.1016/S1055-7903(02)00244-0) [DOI] [PubMed] [Google Scholar]
- 3.Yang Z, Zhu T. 2018. Bayesian selection of misspecified models is overconfident and may cause spurious posterior probabilities for phylogenetic trees. Proc. Natl. Acad. Sci. 115, 1854–1859. ( 10.1073/pnas.1712673115) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL, Tamura K. 2012. Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472. ( 10.1093/molbev/msr202) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Roycroft EJ, Moussalli A, Rowe KC. 2019. Phylogenomics uncovers confidence and conflict in the rapid radiation of Australo-Papuan rodents. Syst. Biol. 69, 431–444. ( 10.1093/sysbio/syz044) [DOI] [PubMed] [Google Scholar]
- 6.Dell'Ampio E, et al. 2014. Decisive data sets in phylogenomics: lessons from studies on the phylogenetic relationships of primarily wingless insects. Mol. Biol. Evol. 31, 239–249. ( 10.1093/molbev/mst196) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Jarvis ED, et al. 2014. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science 346, 1320–1331. ( 10.1126/science.1251385) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Jeffroy O, Brinkmann H, Delsuc F, Philippe H. 2006. Phylogenomics: the beginning of incongruence? Trends Genet. 22, 225–231. ( 10.1016/j.tig.2006.02.003) [DOI] [PubMed] [Google Scholar]
- 9.Platt RN, et al. 2018. Conflicting evolutionary histories of the mitochondrial and nuclear genomes in New World Myotis bats. Syst. Biol. 67, 236–249. ( 10.1093/sysbio/syx070) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Reddy S, et al. 2017. Why do phylogenomic data sets yield conflicting trees? Data type influences the avian tree of life more than taxon sampling. Syst. Biol. 66, 857–879. ( 10.1093/sysbio/syx041) [DOI] [PubMed] [Google Scholar]
- 11.Rokas A, Williams BI, King N, Carroll SB. 2003. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature 425, 798–804. ( 10.1038/nature02053) [DOI] [PubMed] [Google Scholar]
- 12.Chan KO, Alexander AM, Grismer LL, Su Y-C, Grismer JL, Quah ESH, Brown RM. 2017. Species delimitation with gene flow: a methodological comparison and population genomics approach to elucidate cryptic species boundaries in Malaysian torrent frogs. Mol. Ecol. 26, 5435–5450. ( 10.1111/mec.14296) [DOI] [PubMed] [Google Scholar]
- 13.Hosner PA, Faircloth BC, Glenn TC, Braun EL, Kimball RT. 2016. Avoiding missing data biases in phylogenomic inference: an empirical study in the landfowl (Aves: Galliformes). Mol. Biol. Evol. 33, 1110–1125. ( 10.1093/molbev/msv347) [DOI] [PubMed] [Google Scholar]
- 14.Huang H, Knowles LL. 2016. Unforeseen consequences of excluding missing data from next-generation sequences: simulation study of RAD sequences. Syst. Biol. 65, 357–365. ( 10.1093/sysbio/syu046) [DOI] [PubMed] [Google Scholar]
- 15.Molloy EK, Warnow T. 2017. To include or not to include: the impact of gene filtering on species tree estimation methods. Syst. Biol. 67, 285–303. ( 10.1093/sysbio/syx077) [DOI] [PubMed] [Google Scholar]
- 16.Mendes FK, Hahn MW. 2018. Why concatenation fails near the anomaly zone. Syst. Biol. 67, 158–169. ( 10.1093/sysbio/syx063) [DOI] [PubMed] [Google Scholar]
- 17.Mirarab S, Bayzid MS, Warnow T. 2016. Evaluating summary methods for multilocus species tree estimation in the presence of incomplete lineage sorting. Syst. Biol. 65, 366–380. ( 10.1093/sysbio/syu063) [DOI] [PubMed] [Google Scholar]
- 18.Roch S, Steel M. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theor. Popul. Biol. 100, 56–62. ( 10.1016/j.tpb.2014.12.005) [DOI] [PubMed] [Google Scholar]
- 19.Léveillé-Bourret É, Starr JR, Ford BA, Moriarty Lemmon E, Lemmon AR. 2018. Resolving rapid radiations within angiosperm families using anchored phylogenomics. Syst. Biol. 67, 94–112. ( 10.1093/sysbio/syx050) [DOI] [PubMed] [Google Scholar]
- 20.Meiklejohn KA, Faircloth BC, Glenn TC, Kimball RT, Braun EL. 2016. Analysis of a rapid evolutionary radiation using ultraconserved elements: evidence for a bias in some multispecies coalescent methods. Syst. Biol. 65, 612–627. ( 10.1093/sysbio/syw014) [DOI] [PubMed] [Google Scholar]
- 21.Hahn MW. 2007. Bias in phylogenetic tree reconciliation methods: implications for vertebrate genome evolution. Genome Biol. 8, R141 ( 10.1186/gb-2007-8-7-r141) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Maddison WP. 1997. Gene trees in species trees. Syst. Biol. 46, 523–536. ( 10.1093/sysbio/46.3.523) [DOI] [Google Scholar]
- 23.Degnan JH, Rosenberg NA. 2009. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends Ecol. Evol. 24, 332–340. ( 10.1016/j.tree.2009.01.009) [DOI] [PubMed] [Google Scholar]
- 24.Xi Z, Liu L, Davis CC. 2015. Genes with minimal phylogenetic information are problematic for coalescent analyses when gene tree estimation is biased. Mol. Phylogenet. Evol. 92, 63–71. ( 10.1016/j.ympev.2015.06.009) [DOI] [PubMed] [Google Scholar]
- 25.Huang H, Knowles LL. 2009. What is the danger of the anomaly zone for empirical phylogenetics? Syst. Biol. 58, 527–536. ( 10.1093/sysbio/syp047) [DOI] [PubMed] [Google Scholar]
- 26.Linkem CW, Minin VN, Leaché AD. 2016. Detecting the anomaly zone in species trees and evidence for a misleading signal in higher-level skink phylogeny (Squamata: Scincidae). Syst. Biol. 65, 465–477. ( 10.1093/sysbio/syw001) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Cloutier A, Sackton TB, Grayson P, Clamp M, Baker AJ, Edwards S V. 2019. Whole-genome analyses resolve the phylogeny of flightless birds (Palaeognathae) in the presence of an empirical anomaly zone. Syst. Biol. 68, 937–955. ( 10.1093/sysbio/syz019) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Burbrink FT, et al. 2020. Interrogating genomic-scale data for squamata (lizards, snakes, and amphisbaenians) shows no support for key traditional morphological relationships. Syst. Biol. 69, 502–520. ( 10.1093/sysbio/syz062) [DOI] [PubMed] [Google Scholar]
- 29.Frost DR. 2020. Amphibian species of the world: an online reference. Version 6.0 (accessed 21 April 2020). See http//research.amnh.org/herpetology/amphibia/index.html. New York, USA: Am. Museum Nat. Hist..
- 30.Hertwig ST, Schweizer M, Das I, Haas A. 2013. Diversification in a biodiversity hotspot—the evolution of Southeast Asian rhacophorid tree frogs on Borneo (Amphibia: Anura: Rhacophoridae). Mol. Phylogenet. Evol. 68, 567–581. ( 10.1016/j.ympev.2013.04.001) [DOI] [PubMed] [Google Scholar]
- 31.Nguyen TT, Matsui M, Eto K. 2015. Mitochondrial phylogeny of an Asian tree frog genus Theloderma (Anura: Rhacophoridae). Mol. Phylogenet. Evol. 85, 59–67. ( 10.1016/j.ympev.2015.02.003) [DOI] [PubMed] [Google Scholar]
- 32.Chan KO, Grismer LL, Brown RM. 2018. Comprehensive multi-locus phylogeny of Old World tree frogs (Anura: Rhacophoridae) reveals taxonomic uncertainties and potential cases of over- and underestimation of species diversity. Mol. Phylogenet. Evol. 127, 1010–1019. ( 10.1016/j.ympev.2018.07.005) [DOI] [PubMed] [Google Scholar]
- 33.Chen JM, et al. 2020. An integrative phylogenomic approach illuminates the evolutionary history of Old World tree frogs (Anura: Rhacophoridae). Mol. Phylogenet. Evol. 145, 106724 ( 10.1016/j.ympev.2019.106724) [DOI] [PubMed] [Google Scholar]
- 34.Meegaskumbura M, Senevirathne G, Biju SD, Garg S, Meegaskumbura S, Pethiyagoda R, Hanken J, Schneider CJ. 2015. Patterns of reproductive-mode evolution in Old World tree frogs (Anura, Rhacophoridae). Zool. Scr. 44, 509–522. ( 10.1111/zsc.12121) [DOI] [Google Scholar]
- 35.Li JT, Li Y, Klaus S, Rao D-Q, Hillis DM, Zhang Y-P. 2013. Diversification of rhacophorid frogs provides evidence for accelerated faunal exchange between India and Eurasia during the Oligocene. Proc. Natl. Acad. Sci. USA 110, 3441–3446. ( 10.1073/pnas.1300881110) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Pyron AR, Wiens JJ. 2011. A large-scale phylogeny of Amphibia including over 2800 species, and a revised classification of extant frogs, salamanders, and caecilians. Mol. Phylogenet. Evol. 61, 543–583. ( 10.1016/j.ympev.2011.06.012) [DOI] [PubMed] [Google Scholar]
- 37.Poyarkov NA, Orlov NL, Moiseeva A V, Galoyan EA, Nguyen TT, Gogoleva SS. 2015. Sorting out moss frogs: mtDNA data on taxonomic diversity and phylogenetic relationships of the Indochinese species of the genus Theloderma (Anura, Rhacophoridae). Russ. J. Herpetol. 22, 241–280. [Google Scholar]
- 38.Sivongxay N, Davankham M, Phimmachak S, Phoumixay K, Stuart BL. 2016. A new small-sized Theloderma (Anura: Rhacophoridae) from Laos. Zootaxa 4147, 433–442. ( 10.11646/zootaxa.4147.4.5) [DOI] [PubMed] [Google Scholar]
- 39.Rowley JJL, Le DTT, Hoang HD, Dau VQ, Cao TT.. 2011. Two new species of Theloderma (Anura: Rhacophoridae) from Vietnam. Zootaxa 3098, 1–20. ( 10.11646/zootaxa.3098.1.1) [DOI] [Google Scholar]
- 40.Hutter CR, Cobb KA, Portik DM, Travers SL, Wood PL, Brown RM. 2019. FrogCap: a modular sequence capture probe set for phylogenomics and population genetics for all frogs, assessed across multiple phylogenetic scales. bioRxiv825307 ( 10.1101/825307) [DOI] [PubMed]
- 41.Feng Y-J, Blackburn DC, Liang D, Hillis DM, Wake DB, Cannatella DC, Zhang P. 2017. Phylogenomics reveals rapid, simultaneous diversification of three major clades of Gondwanan frogs at the Cretaceous–Paleogene boundary. Proc. Natl. Acad. Sci. USA 114, E5864–E5870. ( 10.1073/PNAS.1704632114) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Borowiec ML. 2016. AMAS: a fast tool for alignment manipulation and computing of summary statistics. PeerJ 4, e1660 ( 10.7717/peerj.1660) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Nguyen LT, Schmidt HA, Von Haeseler A, Minh BQ.. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274. ( 10.1093/molbev/msu300) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Abadi S, Azouri D, Pupko T, Mayrose I. 2019. Model selection may not be a mandatory step for phylogeny reconstruction. Nat. Commun. 10, 934 ( 10.1038/s41467-019-08822-w) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Chan KO, Hutter CR, Wood PL, Grismer LL, Brown RM. 2020. Larger, unfiltered datasets are more effective at resolving phylogenetic conflict: introns, exons, and UCEs resolve ambiguities in golden-backed frogs (Anura: Ranidae; genus Hylarana). Mol. Phylogenet. Evol. 151, 106899 ( 10.1016/j.ympev.2020.106899) [DOI] [PubMed] [Google Scholar]
- 46.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Le SV.. 2017. UFBoot2: improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35, 518–522. ( 10.1093/molbev/msx281) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Xia X. 2018. DAMBE7: new and improved tools for data analysis in molecular biology and evolution. Mol. Biol. Evol. 35, 1550–1552. ( 10.1093/molbev/msy073) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Zhang C, Rabiee M, Sayyari E, Mirarab S. 2018. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinf. 19, 15–30. ( 10.1186/s12859-018-2129-y) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS.. 2017. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589. ( 10.1038/nmeth.4285) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Chifman J, Kubatko L. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30, 3317–3324. ( 10.1093/bioinformatics/btu530) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Minh BQ, Hahn MW, Lanfear R. 2020. New methods to calculate concordance factors for phylogenomic datasets. Mol. Biol. Evol. 37, 2727–2733. ( 10.1093/molbev/msaa106) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Allman ES, Degnan JH, Rhodes JA. 2011. Identifying the rooted species tree from the distribution of unrooted gene trees under the coalescent. J. Math. Biol. 62, 833–862. ( 10.1007/s00285-010-0355-7) [DOI] [PubMed] [Google Scholar]
- 53.Sayyari E, Whitfield JB, Mirarab S. 2018. DiscoVista: interpretable visualizations of gene tree discordance. Mol. Phylogenet. Evol. 122, 110–115. ( 10.1016/j.ympev.2018.01.019) [DOI] [PubMed] [Google Scholar]
- 54.Huson DH, Klopper T, Lockhart PJ, Steel MA. 2005. Reconstruction of reticulate networks from gene trees. In Proceedings of the Ninth International Conference on Research in Computational Molecular Biology, pp. 233–249. Berlin, Germany: Springer. [Google Scholar]
- 55.Sayyari E, Mirarab S. 2018. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes 9, 132 ( 10.3390/genes9030132) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Rosenberg NA. 2013. Discordance of species trees with their most likely gene trees: a unifying principle. Mol. Biol. Evol. 30, 2709–2713. ( 10.1093/molbev/mst160) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Warnow T. 2015. Concatenation analyses in the presence of incomplete lineage sorting. PLoS Curr. Tree Life 7, 1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Chou J, Gupta A, Yaduvanshi S, Davidson R, Nute M, Mirarab S, Warnow T. 2015. A comparative study of SVDquartets and other coalescent-based species tree estimation methods. BMC Genomics 16, 1–11. ( 10.1186/1471-2164-16-S10-S2) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Felsenstein J. 1985. Confidence limits on phylogenies: an approach using the bootstrap. Evolution 39, 783–791. ( 10.1111/j.1558-5646.1985.tb00420.x) [DOI] [PubMed] [Google Scholar]
- 60.Soltis PS, Soltis DE. 2003. Applying the bootstrap in phylogeny reconstruction. Stat. Sci. 18, 256–267. ( 10.1214/ss/1063994980) [DOI] [Google Scholar]
- 61.Streicher JW, Schulte JA, Wiens JJ. 2016. How should genes and taxa be sampled for phylogenomic analyses with missing data? An empirical study in Iguanian lizards. Syst. Biol. 65, 128–145. ( 10.1093/sysbio/syv058) [DOI] [PubMed] [Google Scholar]
- 62.Bravo GA, et al. 2019. Embracing heterogeneity: coalescing the tree of life and the future of phylogenomics. PeerJ 2019, 1–60. ( 10.7717/peerj.6399) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Shen XX, Hittinger CT, Rokas A. 2017. Contentious relationships in phylogenomic studies can be driven by a handful of genes. Nat. Ecol. Evol. 1, 1–10. ( 10.1038/s41559-017-0126) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Struck TH. 2013. The impact of paralogy on phylogenomic studies: a case study on annelid relationships. PLoS ONE 8, e0062892 ( 10.1371/journal.pone.0062892) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Siu-Ting K, Torres-Sanchez M, Mauro DS, Wilcockson D, Wilkinson M, Pisani D, O'Connell MJ, Creevey CJ. 2019. Inadvertent paralog inclusion drives artifactual topologies and timetree estimates in phylogenomics. Mol. Biol. Evol. 36, 1344–1356. ( 10.1093/molbev/msz067) [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Chan KO, Hutter CR, Wood PL Jr, Grismer LL, Brown RM. 2020. Data from: Target-capture phylogenomics provide insights on gene and species tree discordances in Old World treefrogs (Anura: Rhacophoridae) Dryad Digital Repository. ( 10.5061/dryad.8cz8w9gn7) [DOI] [PMC free article] [PubMed]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Citations
- Chan KO, Hutter CR, Wood PL Jr, Grismer LL, Brown RM. 2020. Data from: Target-capture phylogenomics provide insights on gene and species tree discordances in Old World treefrogs (Anura: Rhacophoridae) Dryad Digital Repository. ( 10.5061/dryad.8cz8w9gn7) [DOI] [PMC free article] [PubMed]
Supplementary Materials
Data Availability Statement
Raw sequence reads are available at the GenBank SRA: BioProject PRJNA633673 (outgroup samples); BioProject PRJNA659075 (ingroup samples; raw sequences to be uploaded upon acceptance of manuscript). FrogCap bioinformatic scripts are available at: https://github.com/chutter/FrogCap-Sequence-Capture. Custom scripts used for analyses are available at: https://github.com/chankinonn/Phylogenomic-scripts. All relevant alignments, gene trees, consensus/species trees and partitioning files that are required to reproduce this study have been uploaded to the Dryad Digital Repository: https://doi.org/10.5061/dryad.8cz8w9gn7 [66].