Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps

Sudip Sharma; Sudhir Kumar

doi:10.1038/s43588-021-00129-5

. Author manuscript; available in PMC: 2022 Mar 22.

Published in final edited form as: Nat Comput Sci. 2021 Sep 22;1(9):573–577. doi: 10.1038/s43588-021-00129-5

Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps

Sudip Sharma ^1,², Sudhir Kumar ^1,^2,^3,^*

PMCID: PMC8560003 NIHMSID: NIHMS1749132 PMID: 34734192

Abstract

Felsenstein’s bootstrap approach is widely used to assess confidence in species relationships inferred from multiple sequence alignments. It resamples sites randomly with replacement to build alignment replicates of the same size as the original alignment and infers a phylogeny from each replicate dataset. The proportion of phylogenies recovering the same grouping of species is its bootstrap confidence limit. But, standard bootstrap imposes a high computational burden in applications involving long sequence alignments. Here, we introduce the bag of little bootstraps approach to phylogenetics, bootstrapping only a few little samples, each containing a small subset of sites. We report that the median bagging of bootstrap confidence limits from little samples produces confidence in inferred species relationships similar to standard bootstrap but in a fraction of computational time and memory. Therefore, the little bootstraps approach can potentially enhance the rigor, efficiency, and parallelization of big data phylogenomic analyses.

Editor summary:

The authors show that accurate bootstrap confidence limits on inferred evolutionary relationships of species can be estimated by bootstrapping a collection of little samples of very long sequence alignments. Little bootstraps take a fraction of computer time and memory compared to the standard bootstrap, enabling big data analytics on personal computers.

Main text

Felsenstein’s bootstrap resampling approach¹ (standard BS) is being applied to increasingly larger datasets in molecular phylogenetics due to the widespread accessibility of genome sequence databases and the assembly of multispecies and multigene alignments containing hundreds of thousands of bases^2–4. These large datasets have the power to reconstruct hard-to-resolve evolutionary relationships with high confidence^4–14. But, they impose onerous computational demands because the computational complexity of phylogenomic analyses using the maximum likelihood (ML) method increases exponentially with the number of sequences and linearly with sequence length¹⁵ (Fig. 1b). Consequently, standard BS can require a large amount of computer memory and take days to complete for big datasets^4,15. Many heuristics moderate the escalation due to the increasing number of sequences^15,16, but none focus on relieving the onerous computational burden imposed by an increase in sequence length due to the widespread adoption of next-generation sequencing methods.

Figure 1. — (a) Steps in the little bootstrap (BS) approach. Shaded boxes represent sequence alignments, with width representing the sequence length; see main text for a detailed description and Extended Data Fig. 1 for a comparison with Felsentein’s standard BS approach. (b) Time and memory savings per replicate of little bootstrap (open circles) compared to the standard bootstrap (closed circles) for a large simulated dataset containing 446 sequences of 50,000 to 536,534 bases. (c) The relationship of branch lengths and $\hat{B C L}$ produced by little BS with mean-bagging (orange) and median-bagging (blue) for **l = L**^0.7. The x-axis is restricted to the branch length of 0.04 because $\hat{B C L} = 100 %$ for longer branches. (d) The distribution of ***bcl***_is for 53 species groups that received $\hat{B C L} < 100 %$ in the little BS analysis with mean-bagging of the large dataset. (e) The average $\hat{B C L}$ for all the species groups connected to the phylogeny with a given cutoff branch length (x-axis). The x-axis is restricted to 0.02 because the performance does not change any further. (f) The relationship of standard BS (***BCL***) and little BS ( $\hat{B C L}$ ) with mean-bagging and median-bagging for datasets smaller than 10,000 sites (**l = L**^0.9). The gray line shows the 1:1 relationship with the standard BS. The linear regression slope is 0.97 (R² = 0.93) for median-bagging and 0.89 (R² = 0.89) for the mean-bagging. (g) The distribution of little sample ***bcl***s for species groups in smaller datasets for which standard BS ***BCL*** ≥ 95% (black bars = 9,359 sites, gray bars = 7,002 sites, and white bars = 4,070 sites). (h) The true positive rates (TPR) for little BS with mean- and median-bagging compared to other phylogenomic subsampling approaches (PS and PSR with Mean and with Median) in which upsampling was not applied (see Methods).

The little bootstraps (BS) approach for phylogenomics.

Here, we introduce the bag of little bootstraps¹⁷ to place confidence limits on molecular phylogenies. In the little bootstraps approach, bootstrapping is performed independently on s little samples, each containing l sites sampled randomly (with or without replacement) from the full dataset consisting of L sites (l ≪ L). Bootstrap confidence limit for a group of sequences (bcl_i) is estimated for each little dataset i by generating r bootstrap phylogenies. Each bootstrap phylogeny is inferred from the bootstrap replicate dataset that contains L sites sampled with replacement from little subsamples (Fig. 1a). Because l ≪ L, the same site is selected many times (upsampling) to build the bootstrap replicate dataset in the little BS approach (Fig. 1a, see Extended Data Fig. 1). Then, the bootstrap confidence limit ( $\hat{B C L}$ ) for a given group of species is derived from s little sample bcl values, a procedure referred to as bagging. The average of s little sample bcl values, called mean-bagging ( $\hat{B C L} = \frac{1}{s} \sum_{i = 1}^{s} b c l_{i}$ ), was found to work well¹⁷.

In the little BS approach, every site of the little sample is included L/l times, on average, in a bootstrap replicate dataset, so they have the same number of sites as the full dataset. The upsampling has desirable asymptotic theoretical properties¹⁷ and obviates ad hoc corrections needed in other divide-and-conquer approaches¹⁸. As the computational burden of ML phylogeny estimation is proportional to the number of distinct site configurations, time and memory requirements for analyzing a little BS replicate dataset is of order O(L/l) needed for a standard BS replicate (Fig. 1b). Kleiner et al.¹⁷ have suggested the use of little samples of size l = L^g (0.5 < g < 1.0), which can reduce time and memory by orders of magnitude. In phylogenomics, these savings can be substantial and remain low as the length of the sequence alignment increases from thousands to millions of sites (Fig. 1b, Extended Data Fig. 2).

Results

We first present ML phylogenetic analysis of a computer-simulated alignment containing 446 species and 134,131 sites (see Methods). We conducted 100 standard BS replicates, an ad hoc convention adopted in many studies to make calculations feasible¹⁹. It required 6.1 GB of memory and 13.1 CPU hours per replicate (54 CPU days of total computation). These analyses established all the true evolutionary relationships among sequences with very high confidence (BCL ≥ 95%). For this dataset, we generated ten little samples (s = 10) containing l = L^0.7 sites (3,884 sites) and analyzed ten bootstrap datasets for each little sample (r = 10). ML phylogeny inference of each little dataset required ~0.3 GB RAM and ~0.6 hours, a 95% reduction in memory and time compared to the standard BS. Now, several little BS datasets could be run concurrently on a multicore desktop with 8 GB of RAM, unlike the standard BS analyses that took up almost all the memory for estimating the ML phylogeny for one replicate dataset.

However, little BS with mean-bagging did not produce $\hat{B C L} \geq 95 %$ for 32 species groups (7.2% false negatives). These 32 species groups were connected with relatively short branches, and their confidence limits were underestimated by as much as 24% (Fig. 1c). We found that the distribution of little sample bcls was skewed (Fig. 1d), making the mean unsuitable for measuring the central tendency. We explored the use of median because it is more resilient to outliers²⁰, and median-bagging is expected to have the same statistical properties as those established for mean-bagging¹⁷. However, median bagging seems to be not previously applied for the bag of little BS.

Median-bagging eliminated 31 false negatives, and the remaining species group received $\hat{B C L} = 90 %$ (Fig. 1c). The average $\hat{B C L}$ at every branch length threshold was greater than 95% for median-bagging, but not for mean-bagging (Fig. 1e). We confirmed the improvement offered by median-bagging for a greater range of BCL values by analyzing three gene-specific sequence alignments (4,000 < L < 10,000; 446 species; Fig. 1f). Median-bagging performed much better because the distribution of bcls was skewed and contained many outliers for each dataset (Fig. 1g). Also, false-negative rates of phylogenomic subsampling approaches were higher when upsampling or median-bagging were not used (Fig. 1h). We found that little BS needed smaller samples of sites for empirical datasets with larger numbers of unique site configurations per sequence (C/S; Table 1; see Extended Data Fig. 3). Therefore, little BS with median-bagging achieves higher accuracy by overcoming the deficiency of mean-bagging and traditional divide- and-conquer approaches.

Table 1 |.

Performance of little bootstraps analysis for empirical datasets.

	Full Dataset			Little BS samples						Little BS Results				Little BS Resources
Species	No. of Sites (L)	No. of Seqs (S)	Unique Sites/Seq. (C/S)	Power factor (g)	s × r	Sites (l)	Uniq. Sites (c)	Uniq. Sites/Seq. (c/S)	Total Uniq. Sites (U)	Avg. $\hat{B C L}$	ΔBCL	TPR (≥70%)	TPR (≥95%)	Time (Hours)	Memory (GB)	Total Time (Hours)

Butterflies	5,267,461	61	61,684	0.700	4×10	50,714	1.3%	793	5%	100%	0.0%	100%	100%	1.37	0.38	54.8
Plants A	4,246,454	16	11,897	0.700	4×5	43,614	3.3%	389	9%	100%	0.0%	100%	100%	0.08	0.01	1.6
Insects A	3,011,544	174	11,758	0.700	6×8	34,289	1.6%	188	8%	97%	−1.8%	98%	98%	3.80	0.74	182.0
Insects B	2,938,039	48	29,092	0.747	10×12	67,401	3.8%	1,103	25%	91%	4.5%	100%	94%	3.80	0.33	546.0
Insects C	1,719,036	193	7,346	0.748	5×7	46,331	3.2%	236	15%	97%	1.1%	96%	99%	5.80	1.14	548.0
Mammals	1,391,742	37	20,962	0.700	7×9	19,976	2.3%	485	13%	98%	0.0%	97%	100%	0.32	0.11	18.9
Spiders A	137,170	27	3,071	0.800	4×8	12,877	13.4%	411	39%	94%	0.0%	96%	90%	0.08	0.04	2.5
Plants B	135,243	30	795	0.800	7×9	12,732	14.2%	113	56%	99%	−1.0%	100%	96%	0.03	0.01	1.9
Spiders B	89,212	34	1,296	0.800	9×15	9,128	14.4%	186	66%	97%	0.0%	93%	90%	0.06	0.03	14.9
Birds	61,794	39	750	0.863	6×10	13,633	26.7%	201	80%	90%	−3.3%	97%	93%	0.11	0.04	17.4

Open in a new tab

Empirical sequence alignments from a variety of speci es were analyzed (see Methods). The L and S are the total number of sites and sequences, respectively, in the full dataset. C the number of unique site configurations in the full sequence alignment. The power factor (g), the number of little samples (s), and the number of replicates per little sample (r), and were selected using the automatic procedure. l = L^g, which is the number of sites in little samples. c is the number of unique site configurations in a little sample. U is the number of unique site configurations found in all the little samples used for a given dataset. Avg. $\hat{B C L}$ is the average of $\hat{B C L}$ values produced by little BS for all species groupings in a phylogeny. ΔBCL is the difference between average bootstrap supports produced by the standard and little BS approaches. The true positive rate (TPR) is the percentage of species groups statistically supported by standard BS ( $\hat{B C L}$ ) at the given cutoff value, which were also supported in the by little BS analysis ( $\hat{B C L}$ ) at that cutoff value. The time and memory estimates are for one little BS dataset. The total time is for the completion of all little bootstrap replicates in a single computing thread.

For practical applications of little BS, we developed a simple, automated protocol to tune key parameters (g, s, and r; Methods). Its application to the 446×134,131 dataset confirmed all correct species groups ( $\hat{B C L} = 100 %$ ; g = 0.8, s = 4, and r = 6). We applied the automated protocol to analyze empirical sequence alignments (Table 1). Also, we generated standard errors (SEs) of $\hat{B C L}$ estimates during the little BS analysis in which little samples and replicate phylogenies were resampled with replacement (see Methods). High precision (low SEs) for $\hat{B C L}$ was achieved even when using small s and r because $\hat{B C L}$ values were generally high for most of the species groupings in long sequence alignments (Table 1; Extended Data Fig. 4).

Next, we evalulaed the performance of little BS for empirical datasets. The accuracy of little BS with median bagging was excellent in these analyses (Table 1). The true positive rate (TPR) at $\hat{B C L} \geq 95 %$ was greater than 95% for six datasets and 90% for the other four (Table 1). Phylogeny-wide average $\hat{B C L}$ was close to that from standard BS BCL, as the average difference was only 0.1% achieved by analyzing little samples containing only a fraction of sites and memory (Table 1). The computation time was in minutes to hours per little dataset (Table 1). For example, the little BS analysis of the mammalian dataset required 0.1 GB per replicate, on average, rather than 3.1 GB RAM (~29-fold memory savings) and 0.32 CPU hours rather than 9.8 CPU hours per bootstrap replicate (31-fold time efficiency). These savings enabled multiple concurrent little BS replicates on a standard multicore personal desktop equipped with a modest memory (8 GB). A similar pattern was seen for the other nine empirical datasets (Table 1).

We also evaluated little BS (LBS) performance by combining it with Ultrafast bootstrap¹⁶ (UFB). UFB makes standard bootstrapping faster for a large number of sequences. For the mammalian dataset, LBS+UFB required only 50 minutes (0.2 GB RAM) on a computer with 5 cores when using ten little samples (r = 1,000, default in IQTREE^16,21). This was much faster and leaner than using only one of the optimizations: UFB alone required 4.5 hours and 7.1 GB of RAM, whereas LBS alone needed 19.8 hours and 0.1 GB of RAM. Therefore, plugging in the UFB optimization for generating sample-wise bcl’’s further increases memory and time savings. In the future, we expect LBS to be used along with other efficient heuristics developed to speed up bootstrap calculations^15,16. One may also use Transfer Bootstrap²² when estimating confidence limits.

However, users need to ensure that sufficiently large little samples are utilized in the little BS approach. We recommend using the automatic pipeline to selecting key parameters for little BS analysis (g, s, and r). In addition, it will be prudent to inspect SEs reported and reconfirm high $\hat{B C L} s$ associated with large SEs (low precision) by conducting additional little BS analysis with a larger number of sites in little samples as well as more little samples and number of bootstrap replicates.

In conclusion, the little BS approach can help break the bottleneck created by the rise of large genomic datasets assembled from burgeoning sequence databases. It can enable parallelization even with modest computational resources and promote greater reproducibility and scientific rigor in building the tree of life that requires assessing the robustness of inferences to selecting biologically distinct subsets of data, choice of substitution models and strategies, and application of a myriad of ways of combining multigene datasets.

METHODS

Simulated and empirical sequence data assembly.

We analyzed multigene alignments assembled from a collection of simulated datasets analyzed previously^23,24. They were generated using an evolutionary tree of 446 species and a wide range of biologically realistic parameter values derived from hundreds of empirical gene sequence alignments, including sequence length (445 – 4,439 bases), G+C content (39 – 82%), transition/transversion rate ratio (1.9 – 6.0), and gene-wise evolutionary rates (1.35 – 2.60×10⁻⁶ per site per billion years)²³. Evolutionary rates were also heterogeneous across lineages, simulated for each gene independently under autocorrelated and uncorrelated rate models^23,24. Simulated alignments of 100 genes that evolved with the autocorrelated rate model were concatenated to form the 446×134,131 (species x bases) dataset. A bigger 446×536,524 sequence alignment was generated by concatenating sequence alignments generated by concatenating 100 randomly selected gene alignments from each of the four different lineage rate variation models simulated²⁴. Three smaller datasets were analyzed, corresponding to individual simulated genes: 446×4,070, 446×7,002, and 446×9,359 bases.

Ten empirical datasets were also analyzed. These DNA alignments consisted of sequences from eutherian mammals¹⁴, butterflies⁷, plants (A⁶ and B¹⁰), insects (A¹¹, B¹², and C⁵), spiders (A⁹ and B⁸), and birds¹³ (Table 1). The number of species ranged from 16 to 193, and the number of sites ranging from 61,794 to 5,267,461. We used the phylogenetic trees (ML trees) presented in the original studies as the reference trees for empirical datasets. The ground truth for little BS confidence limits were the standard BS confidence limits reported in the published articles.

Standard and little BS analyses.

We used the IQTREE software²¹ with a general time-reversible nucleotide substitution model with gamma-distributed rate variation (GTR+Γ) and default ML search parameters. One hundred replicates of standard bootstrap analyses were conducted to generate BCLs, all of which were very high for large datasets analyzed. For three single-gene datasets, 1,000 bootstrap replicates were used to generate stable BCLs. The confidence limits obtained using the standard bootstrap analyses were the ground truth in our analyses, as the bag of little bootstraps is being investigated as a computationally efficient alternative. The true tree used in computer simulations was the reference in the analysis of simulated datasets. The bootstrap confidence limits presented in the published phylogenies were used as references for the empirical datasets analyzed. The parameters of little BS analyses for these datasets were selected using the protocol presented below. We also applied the Ultrafast bootstrap (UFB)¹⁶ on the mammal dataset using the (GTR+Γ) model in IQTREE with the default option of 1,000 replicates. For little bootstrap analysis, the UFB with the same options was carried out for each little dataset directly to estimate the required time and memory. These are approximate estimates because IQTREE does not have a provision for upsampling when generating bootstrap replicate datasets. The reported estimates are expected to be very close to the actual time estimates because IQTREE compresses identical site configurations during ML calculations, and upsampling only alters site configurations’ frequencies.

Automatic selection of the little BS parameters.

Our procedure automatically determines the size of the sample (g), the number of samples (s), and the number of bootstrap replicates (r). The procedure starts with g = 0.7 if the sequence alignment contains ≥ 100,000 unique site configuration (such that l < 50,000), otherwise we set g = 0.8. One may set any starting or fixed value of g. In step 1, we conduct little BS with s = 3 and r = 3 to generate initial $\hat{B C L}$ for all the nodes in the given phylogeny (if provided) or from a majority rule bootstrap consensus tree. Using these values, we generate average $\hat{B C L}$ (Av) and the fraction of inferred tree partitions with $\hat{B C L} \geq 95 %$ (Nv). Through an iterative process, we stabilize and maximize both Av and Nv, as follows. In step 2, we add one little BS replicate to each subsample (i.e., r increases by 1) and then compute Av. We repeat steps 2 and 3 by increasing r until the difference in successive Av values is less than 0.1% (or a user-specified threshold, δr). In step 4, we increase s by one and generate r additional replicate datasets and phylogenies, and compute Av and Nv. If the difference between Av for the current (s) and the previous (s-1) set of subsamples is greater than 1% (or user-specified δs), then we repeat step 4. In step 5, we check and see if Nv is less than 100% or the SE of estimated $\hat{B C L} \geq 95 %$ is too high (>5%). If so, we increase the little subsample size by l and restart the analysis from step 2. In step 6, we go to step 4 if the user-specified precision (SE) has not been achieved.

Estimating standard errors of $\hat{B C L} s$ .

Given r bootstrap replicate-phylogenies for s samples, we employ a bootstrap procedure to generate SE of $\hat{B C L}$ . We use already computed phylogenies of r × s little BS replicates and derive $\hat{B C L}$ for all the nodes from collections of phylogenies by resampling s samples with replacement and r replicates with replacement every time a subsample is selected. This process is repeated 100 times, and the standard deviation of each tree partition’s $\hat{B C L}$ is generated to estimate its SE. This process is extremely fast because precomputed phylogenies are used.

Phylogenomic subsampling approaches without upsampling.

We also generated $\hat{B C L}$ values by a little BS procedure in which upsampling was replaced by the standard BS resampling such that the replicate datasets contained only l sites rather than L sites. We refer to this as the Phylogenomic Subsampling with Resampling (PSR) approach. For PSR, one may use either mean- or median-bagging. We also generated $\hat{B C L} s$ without any resampling or upsampling (i.e., r = 0) such that the ML phylogenies were inferred from s subsample datasets containing l sites each. We call it the Phylogenomic Subsampling (PS) approach. We compared the true positive rates ( $\hat{B C L} \geq 95 %$ ) of little BS, PSR, and PS approaches for the computer-simulated 446×134,131 dataset (g = 0.7) For all analyses, 100 replicate phylogenies were generated by using s = 10 and r =10 for little BS and PSR, and s = 100 for the PS approach.

Analysis pipeline for little BS.

We developed an R²⁵ pipeline to conduct little bootstraps analysis by using IQTREE. In this case, we used the Biostrings²⁶ package to generate little datasets of the specified lengths (l) and then bootstrap replicate datasets in which L sites were resampled with replacement from l sites. The resulting datasets were used to obtain ML phylogenies that were summarized by using the function plotBS from the phangorn²⁷ library that produced the bcl for each of the phylogenetic groups in the standard bootstrap phylogeny. Mean and median-bagging estimates were obtained from sample-wise bcls from s little samples using a customized function in R. We used ten samples and ten bootstrap replicates for little bootstraps analysis for concatenated gene datasets, and 50 little samples and 20 bootstrap replicates for single-gene datasets. We applied the automated protocol using a customized R function. We also developed a customized R function for estimating SEs of $\hat{B C L} s$ .

Data availability:

All simulated DNA sequence alignments containing 446 taxa were obtained from published research articles^23,24. Ten empirical datasets from a variety of species have been analyzed. These DNA sequence alignments consisted of sequences from eutherian mammals¹⁴, butterflies⁷, plants (A⁶ and B¹⁰), insects (A¹¹, B¹², and C⁵), spiders (A⁹ and B⁸), and birds¹³. All empirical and simulated datasets analyzed in this article are available in an online repository³⁰.

Code availability:

R codes are available from https://github.com/ssharma2712/Little-Bootstraps. A capsule containing source codes and datasets for our analyses is available on the CodeOcean service²⁸. Users can replicate the little bootstraps sampling and bagging steps in this capsule.

Extended Data

Extended Data Fig. 3 — The relationship of the number of unique site configurations per sequence (***C/S,*** log-transformed) and little sample size selected (power parameter, g) (R² = 0.76).

Extended Data Fig. 4 — The relationship between little $\hat{B C L} s$ and their precision (standard errors, SEs) for the selected little BS parameters. The SEs are inversely related to little bootstrap confidence limits (R² = 0.59).

Supplementary Material

1749132_sd_ed_fig2.

NIHMS1749132-supplement-1749132_sd_ed_fig2_.zip^{(1.3KB, zip)}

1749132_sd_1

NIHMS1749132-supplement-1749132_sd_1.zip^{(197.1MB, zip)}

1749132_sd_ed_fig3

NIHMS1749132-supplement-1749132_sd_ed_fig3.zip^{(7.5KB, zip)}

1749132_sd_ed_fig4

NIHMS1749132-supplement-1749132_sd_ed_fig4.zip^{(3MB, zip)}

Acknowledgments

We thank Sara Vahdatshoar and Julia Davis for their help with computational analysis. We thank Drs. Jack Craig, Qiqing Tao, Marcos Caraballo-Ortiz, Antonia Chroni, Cristian Palacios, Sergei L. K. Pond, and S. Blair Hedges for providing critical comments on the manuscript. This research was supported by a grant from the U.S. National Institutes of Health to S.K. (GM139540-01).

Footnotes

Competing Interests Statement

The authors declare that they have no competing interests.

Editor recognition statement:

Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Reviewer recognition statement:

Nature Computational Science thanks Alexandros Stamatakis and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

References:

1.Felsenstein J Confidence Limits on Phylogenies: An approach Using the Bootstrap. Evolution. 39, 783–791 (1985). [DOI] [PubMed] [Google Scholar]
2.Kumar S & Filipski A Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007). [DOI] [PubMed] [Google Scholar]
3.Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL & Tamura K Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Kapli P, Yang Z & Telford MJ Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020). [DOI] [PubMed] [Google Scholar]
5.Johnson KP et al. Phylogenomics and the evolution of hemipteroid insects. Proc. Natl. Acad. Sci. U. S. A. 115, 12775–12780 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Ran JH, Shen TT, Wu H, Gong X & Wang XQ Phylogeny and evolutionary history of Pinaceae updated by transcriptomic analysis. Mol. Phylogenet. Evol. 129, 106–116 (2018). [DOI] [PubMed] [Google Scholar]
7.Allio R et al. Whole genome shotgun phylogenomics resolves the pattern and timing of swallowtail butterfly evolution. Syst. Biol. 69, 38–60 (2020). [DOI] [PubMed] [Google Scholar]
8.Hedin M, Derkarabetian S, Alfaro A, Ramírez MJ & Bond JE Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci. PeerJ 7, e6864 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Kuntner M et al. Golden Orbweavers Ignore Biological Rules: Phylogenomic and Comparative Analyses Unravel a Complex Evolution of Sexual Size Dimorphism. Syst. Biol. 68, 555–572 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Pessoa-Filho M, Martins AM & Ferreira ME Molecular dating of phylogenetic divergence between Urochloa species based on complete chloroplast genomes. BMC Genomics 18, 1–14 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Peters RS et al. Evolutionary History of the Hymenoptera. Curr. Biol. 27, 1013–1018 (2017). [DOI] [PubMed] [Google Scholar]
12.Peters RS et al. Transcriptome sequence-based phylogeny of chalcidoid wasps (Hymenoptera: Chalcidoidea) reveals a history of rapid radiations, convergence, and evolutionary success. Mol. Phylogenet. Evol. 120, 286–296 (2018). [DOI] [PubMed] [Google Scholar]
13.Yonezawa T et al. Phylogenomics and Morphology of Extinct Paleognaths Reveal the Origin and Evolution of the Ratites. (2017) doi: 10.1016/j.cub.2016.10.029. [DOI] [PubMed] [Google Scholar]
14.Song S, Liu L, Edwards SV & Wu S Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl. Acad. Sci. U. S. A. 109, 14942–14947 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Stamatakis A, Hoover P & Rougemont J A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758–771 (2008). [DOI] [PubMed] [Google Scholar]
16.Minh BQ, Nguyen MAT & Von Haeseler A Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Kleiner A, Talwalkar A, Sarkar P & Jordan MI A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B Stat. Methodol. 76, 795–816 (2014). [Google Scholar]
18.Seo T-K Calculating Bootstrap Probabilities of Phylogeny Using Multilocus Sequence Data. Mol. Biol. Evol. 25, 960–971 (2008). [DOI] [PubMed] [Google Scholar]
19.Pattengale ND, Alipour M, Bininda-Emonds ORP, Moret BME & Stamatakis A How many bootstrap replicates are necessary? J. Comput. Biol. 17, 337–354 (2010). [DOI] [PubMed] [Google Scholar]
20.Leys C, Ley C, Klein O, Bernard P & Licata L Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 764–766 (2013). [Google Scholar]
21.Nguyen LT, Schmidt HA, Von Haeseler A & Minh BQ IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Lemoine F et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556, 452–456 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Rosenberg MS & Kumar S Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. Mol. Biol. Evol. 20, 610–621 (2003). [DOI] [PubMed] [Google Scholar]
24.Tamura K et al. Estimating divergence times in large molecular phylogenies. Proc. Natl. Acad. Sci. U. S. A. 109, 19333–19338 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
25.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (2020). [Google Scholar]
26.Pagès H, Aboyoun P, Gentleman R & DebRoy S Biostrings: Efficient manipulation of biological strings. R package version 2.46.0 (2017). [Google Scholar]
27.Schliep KP phangorn: Phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Sharma S & Kumar S Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps (CodeOcean, 2021). doi: 10.24433/CO.6432188.v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Efron B Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. Am. Stat. Assoc. 78, 316 (1983). [Google Scholar]
30.Sharma S & Kumar S Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps (figshare data repository: 10.6084/m9.figshare.14130494). [DOI] [PMC free article] [PubMed]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1749132_sd_ed_fig2.

NIHMS1749132-supplement-1749132_sd_ed_fig2_.zip^{(1.3KB, zip)}

1749132_sd_1

NIHMS1749132-supplement-1749132_sd_1.zip^{(197.1MB, zip)}

1749132_sd_ed_fig3

NIHMS1749132-supplement-1749132_sd_ed_fig3.zip^{(7.5KB, zip)}

1749132_sd_ed_fig4

NIHMS1749132-supplement-1749132_sd_ed_fig4.zip^{(3MB, zip)}

Data Availability Statement

[R1] 1.Felsenstein J Confidence Limits on Phylogenies: An approach Using the Bootstrap. Evolution. 39, 783–791 (1985). [DOI] [PubMed] [Google Scholar]

[R2] 2.Kumar S & Filipski A Multiple sequence alignment: In pursuit of homologous DNA positions. Genome Res. 17, 127–135 (2007). [DOI] [PubMed] [Google Scholar]

[R3] 3.Kumar S, Filipski AJ, Battistuzzi FU, Kosakovsky Pond SL & Tamura K Statistics and truth in phylogenomics. Mol. Biol. Evol. 29, 457–472 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Kapli P, Yang Z & Telford MJ Phylogenetic tree building in the genomic age. Nat. Rev. Genet. 21, 428–444 (2020). [DOI] [PubMed] [Google Scholar]

[R5] 5.Johnson KP et al. Phylogenomics and the evolution of hemipteroid insects. Proc. Natl. Acad. Sci. U. S. A. 115, 12775–12780 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Ran JH, Shen TT, Wu H, Gong X & Wang XQ Phylogeny and evolutionary history of Pinaceae updated by transcriptomic analysis. Mol. Phylogenet. Evol. 129, 106–116 (2018). [DOI] [PubMed] [Google Scholar]

[R7] 7.Allio R et al. Whole genome shotgun phylogenomics resolves the pattern and timing of swallowtail butterfly evolution. Syst. Biol. 69, 38–60 (2020). [DOI] [PubMed] [Google Scholar]

[R8] 8.Hedin M, Derkarabetian S, Alfaro A, Ramírez MJ & Bond JE Phylogenomic analysis and revised classification of atypoid mygalomorph spiders (Araneae, Mygalomorphae), with notes on arachnid ultraconserved element loci. PeerJ 7, e6864 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] 9.Kuntner M et al. Golden Orbweavers Ignore Biological Rules: Phylogenomic and Comparative Analyses Unravel a Complex Evolution of Sexual Size Dimorphism. Syst. Biol. 68, 555–572 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Pessoa-Filho M, Martins AM & Ferreira ME Molecular dating of phylogenetic divergence between Urochloa species based on complete chloroplast genomes. BMC Genomics 18, 1–14 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Peters RS et al. Evolutionary History of the Hymenoptera. Curr. Biol. 27, 1013–1018 (2017). [DOI] [PubMed] [Google Scholar]

[R12] 12.Peters RS et al. Transcriptome sequence-based phylogeny of chalcidoid wasps (Hymenoptera: Chalcidoidea) reveals a history of rapid radiations, convergence, and evolutionary success. Mol. Phylogenet. Evol. 120, 286–296 (2018). [DOI] [PubMed] [Google Scholar]

[R13] 13.Yonezawa T et al. Phylogenomics and Morphology of Extinct Paleognaths Reveal the Origin and Evolution of the Ratites. (2017) doi: 10.1016/j.cub.2016.10.029. [DOI] [PubMed] [Google Scholar]

[R14] 14.Song S, Liu L, Edwards SV & Wu S Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc. Natl. Acad. Sci. U. S. A. 109, 14942–14947 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Stamatakis A, Hoover P & Rougemont J A rapid bootstrap algorithm for the RAxML web servers. Syst. Biol. 57, 758–771 (2008). [DOI] [PubMed] [Google Scholar]

[R16] 16.Minh BQ, Nguyen MAT & Von Haeseler A Ultrafast approximation for phylogenetic bootstrap. Mol. Biol. Evol. 30, 1188–1195 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Kleiner A, Talwalkar A, Sarkar P & Jordan MI A scalable bootstrap for massive data. J. R. Stat. Soc. Ser. B Stat. Methodol. 76, 795–816 (2014). [Google Scholar]

[R18] 18.Seo T-K Calculating Bootstrap Probabilities of Phylogeny Using Multilocus Sequence Data. Mol. Biol. Evol. 25, 960–971 (2008). [DOI] [PubMed] [Google Scholar]

[R19] 19.Pattengale ND, Alipour M, Bininda-Emonds ORP, Moret BME & Stamatakis A How many bootstrap replicates are necessary? J. Comput. Biol. 17, 337–354 (2010). [DOI] [PubMed] [Google Scholar]

[R20] 20.Leys C, Ley C, Klein O, Bernard P & Licata L Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 764–766 (2013). [Google Scholar]

[R21] 21.Nguyen LT, Schmidt HA, Von Haeseler A & Minh BQ IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] 22.Lemoine F et al. Renewing Felsenstein’s phylogenetic bootstrap in the era of big data. Nature 556, 452–456 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Rosenberg MS & Kumar S Heterogeneity of nucleotide frequencies among evolutionary lineages and phylogenetic inference. Mol. Biol. Evol. 20, 610–621 (2003). [DOI] [PubMed] [Google Scholar]

[R24] 24.Tamura K et al. Estimating divergence times in large molecular phylogenies. Proc. Natl. Acad. Sci. U. S. A. 109, 19333–19338 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.R Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. (2020). [Google Scholar]

[R26] 26.Pagès H, Aboyoun P, Gentleman R & DebRoy S Biostrings: Efficient manipulation of biological strings. R package version 2.46.0 (2017). [Google Scholar]

[R27] 27.Schliep KP phangorn: Phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Sharma S & Kumar S Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps (CodeOcean, 2021). doi: 10.24433/CO.6432188.v1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Efron B Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. Am. Stat. Assoc. 78, 316 (1983). [Google Scholar]

[R30] 30.Sharma S & Kumar S Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps (figshare data repository: 10.6084/m9.figshare.14130494). [DOI] [PMC free article] [PubMed]

PERMALINK

Fast and accurate bootstrap confidence limits on genome-scale phylogenies using little bootstraps

Sudip Sharma

Sudhir Kumar

Abstract

Editor summary:

Main text

Figure 1. The little BS approach and analyses of simulated and empirical phylogenomic datasets |.

The little bootstraps (BS) approach for phylogenomics.