Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling

Eleonora Rachtman; Shahab Sarmashghi; Vineet Bafna; Siavash Mirarab

doi:10.1016/j.cels.2022.06.007

. Author manuscript; available in PMC: 2023 Oct 19.

Published in final edited form as: Cell Syst. 2022 Oct 19;13(10):817–829.e3. doi: 10.1016/j.cels.2022.06.007

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling

Eleonora Rachtman ¹, Shahab Sarmashghi ², Vineet Bafna ³, Siavash Mirarab ^2,^4,^*

PMCID: PMC9589918 NIHMSID: NIHMS1824003 PMID: 36265468

Abstract

Computing distance between two genomes without alignments or even access to assemblies has many downstream analyses. However, alignment-free methods, including in the fast-growing field of genome skimming, are hampered by a significant methodological gap. While accurate methods (many k-mer-based) for assembly-free distance calculation exist, measuring the uncertainty of estimated distances has not been sufficiently studied. In this paper, we show that bootstrapping, the standard non-parametric method of measuring estimator uncertainty, is not accurate for k-mer-based methods that rely on k-mer frequency profiles. Instead, we propose using subsampling (with no replacement) in combination with a correction step to reduce the variance of the inferred distribution. We show that the distribution of distances using our procedure matches the true uncertainty of the estimator. The resulting phylogenetic support values effectively differentiate between correct and incorrect branches and identify controversial branches that change across alignment-free and alignment-based phylogenies reported in the literature.

Keywords: Genomic distance, Uncertainty quantification, Genome skimming, Phylogenetic uncertainty, Statistical branch support, Data subsampling, Assembly-free and alignment-free distance calculation, k-mer -based distance calculation

Graphical Abstract

graphic file with name nihms-1824003-f0007.jpg

In Brief

Usability and interpretability of evolutionary trees constructed using alignment-free and assembly-free methods is limited due to a lack of robust approaches for uncertainty estimation. The standard bootstrapping method (sampling with replacement) cannot be used because it violates the assumptions of estimators. As an alternative, Rachtman et al. propose to use subsampling (without replacement) in combination with a correction to account for the increased variance of subsampled data, evaluate the proposed procedure in multiple experiments that demonstrate its accuracy, and apply it to a range of low coverage genome skimming data.

Introduction

Phylogenetic and population genetic analyses using assembly-free and alignment-free methods have enjoyed renewed interest in recent years. Alignment-based methods using orthologous loci have traditionally been considered more accurate than alignment-free methods for phylogenetics (Höhl and Ragan, 2007; Bogusz and Whelan, 2016), and this perception has not changed (justifiably, in our judgment). However, the main driver of recent interest in assembly-free and alignment-free methods is the increased use of data types that do not avail themselves to assembly, in particular, the study of biodiversity at high precision (Bohmann et al., 2020). Biologists now routinely use low-coverage shotgun sequencing data, called genome skims (Weitemier et al., 2014; Coissac et al., 2016), obtained across a large number of samples to study biodiversity and ecology at high levels of taxonomic resolution accurately and relatively cheaply. A simple search of the term “genome skimming” reveals that more than 950 papers have been written on the subject since 2020. While the traditional use of skims relied on the assembly of organelle genomes (a tiny fraction of the reads), there has been an increasing recognition that better accuracy can be obtained using assembly-free and alignment-free fashion by analysis of nuclear data (Sarmashghi et al., 2019). Assembly has to be avoided because of the low coverage (e.g., 1x), and alignment is impossible when genome skims are compared to a reference library of other genome skims (and not closely related genomes).

Alignment-free phylogenetics has a long history of method development (Wu et al., 2009; Jun et al., 2010; Haubold, 2014; Bogusz and Whelan, 2016; Leimeister and Morgenstern, 2014; Leimeister et al., 2017) and benchmarking (Höhl and Ragan, 2007; Zielezinski et al., 2019). Many of these methods require the assembly of sequences instead of working with bags of reads. However, assembly-free methods have been developed for inferring phylogenies (Yi and Jin, 2013; Fan et al., 2015; Allman et al., 2017) and for estimating genomic distances between two bags of reads (e.g., Ondov et al., 2016; Sarmashghi et al., 2019; Lau et al., 2019; Tang et al., 2019), which can then be used with standard distance-based phylogenetic estimation methods to infer a tree or to place a sample on an existing phylogeny (Balaban et al., 2020, 2022b). In particular, some of these methods, such as Skmer (Sarmashghi et al., 2019) and Afann (Tang et al., 2019), are specifically designed for datasets with shallow sequencing depth.

A major obstacle to the widespread adoption of assembly-free methods for downstream analyses such as phylogenetics is the lack of reliable ways to measure the statistical support of inferred trees. The method of Wittler (2020) automatically assigns a weight to each split it infers; however, these weights are not interpretable as the probability of correctness and can only be used for a specific distance estimation method. Lack of support makes it hard to interpret phylogenetic trees as the (inevitable) differences between inferred trees cannot be put in context without knowing if those differences have high support. Moreover, many downstream applications attempt to sum over tree uncertainty (Holder and Lewis, 2003), or at least, contract low support branches of the tree into polytomies (Zhang et al., 2018; Simmons and Gatesy, 2021). Finally, in some applications (such as interrogating polytomies, within-species diversification, and delimitation of subspecies), whether one can resolve a relationship or not is the question of the interest (Maddison, 1989; Townsend et al., 2012), and that question cannot be answered without statistical support.

While a robust set of tools for support estimation exist for alignment-based distance calculation and phylogenetics, much less is known about statistical support for assembly-free methods. Outside Bayesian methods, bootstrapping (i.e., resampling – sampling with replacement) of alignment sites is the main method used for uncertainty quantification, including in phylogenetics, where it has been long used (Felsenstein, 1985) and debated (e.g., Felsenstein and Kishino, 1993; Hillis and Bull, 1993; Susko, 2009; Salichos and Rokas, 2013). While many biologists interpret support as the probability of correctness, bootstrap is more precisely interpreted as a way to measure the variance around an estimator. Regardless of the precise meaning, in the alignment-free settings, it is unclear what needs to be resampled. We can resample reads or smaller units such as k-mer s or spaced words used by most distance estimators. However, assumptions of independently and identically distributed (i.i.d) data units, which are somewhat reasonable for sites in an alignment, may not apply to such units of data. Moreover, assembly-free methods often make assumptions about a random sampling of reads across the genome, and resampling can invalidate those assumptions.

The non-parametric statistics literature has established that subsampling can provide a powerful alternative to resampling, with fewer strong assumptions and better generalizations (Politis et al., 1999). Because a subsample of a shotgun sequenced set of reads follows the same distribution as the original sample, albeit with lower coverage, it provides an attractive alternative to bootstrapping. The lower coverage samples obtained with subsampling will clearly lead to more variable distances than the original data. This bias, however, can be corrected using methods known in the non-parametric statistics literature (Politis et al., 1999).

In this paper, we introduce and carefully validate methods for estimating uncertainty around distances computed directly from genomes and genome skims using assembly/alignment-free methods. We propose using subsampling combined with a correction for increased variance instead of resampling. This procedure computes a distribution of distances for each pair of genomes or genome skims, as opposed to a single distance. From there, computing branch support for phylogenetic trees using distance-based methods follows the standard approach. To be more precise about the input and output, let us define them more formally.

Given are genome skims (i.e., two bags of reads), generated at low coverage (e.g., 2X), from two genomes. The two genomes have evolved from a shared common ancestor through an evolutionary process parameterized by the true genomic distance d. The simplest such model is applying substitutions according to the Jukes and Cantor (1969) (JC) model along two branches of total length $\frac{4}{3} \log (1 - \frac{4}{3} d)$ . The two genomes are then subsampled at random positions using short reads with sequencing errors. Thus, starting from a fixed common ancestor, there are two sources of randomness: the evolutionary process (e.g., substitutions) and the genome sequencing procedure (e.g., random sampling of reads and sequencing errors). The input can also be two assemblies, where only the first source of randomness exists.

We have access to an estimator $\hat{d}$ of d given two genomes or genome skims. If the estimator were to be applied to (unavailable) data generated by a procedure identical to what generated our data (e.g., in a simulation), it would be draws from some unknown distribution $D$ . Our goal is to estimate the uncertainty of the estimate $\hat{d}$ by generating m estimates ${\hat{d}}_{1} \dots {\hat{d}}_{m}$ drawn (approximately) from $D$ . Thus, while we have one pair of genome skims, we seek to approximate the distribution of estimates had we access to an infinite number of genome skim pairs generated identically from the same common ancestor.

For the estimator of d, in this paper, we focus on a leading assembly-free distance calculation method, Skmer, noting that our procedure is general and is applicable to any estimator. Skmer has two components: estimating sequencing parameters for each input skim and estimating shared k-mer s between two given skims (see STAR Methods). The sequencing parameters are estimated by matching the k-mer frequencies against the Poisson distribution distorted by sequencing errors (see (3) in STAR Methods). Then the portion of shared k-mer s, the Jaccard index, is computed using the min-hash technique of Mash (Ondov et al., 2016) (k = 31 in both components). These results are then combined to a final estimate using the analytical equation (4) given in STAR Methods.

In simulations and on a set of real datasets, we show that our subsampling procedure paired with Skmer produces reliable support values from both genome and genome skims. We evaluate the method under conditions with model misspecification and show that while support is not always fully calibrated, it is predictive in distinguishing correct and incorrect branches.

Results

Subsampling Procedure

Subsampling: justification and theory

Why not bootstrapping?

While the bootstrapping method of Efron (1979) provides a statistically consistent approximation of $D$ , some of its assumptions, such as independence of data points, are violated in our setting. Moreover, bootstrapping breaks the assumptions of the estimator. The most poignant problem is non-random coverage. Many assembly-free methods, including Skmer, assume that reads are distributed randomly throughout the genome and thus, model the number of times each position is sampled using a Poisson distribution. Once observed reads or k-mer s are resampled again, this Poisson assumption is broken (Fig. 1A). Fan et al. (2015) attempted to extend the bootstrapping procedure to account for some of the dependencies in assembly-free settings by using block bootstrapping. However, their method uses resampling and is not appropriate for estimators that rely on random coverage of the genome. In empirical analyses, we can establish that bootstrapping leads to widely inaccurate and biased estimates (Fig. 1B).

Figure 1: — A) Theoretical model. Consider a set S of N = 10⁵ objects (e.g., k-mer s of a genome). We show the empirical distribution of the number of times an object is observed (i.e., k-mer frequencies) if S is sampled uniformly at random N times with replacement to get R; this process is similar to sequencing a genome at 1X coverage. The distribution closely matches Poisson with λ = 1. Next, we again resample R (with replacement) N times, which is equivalent to the bootstrapping reads or k-mer s, getting a distribution that strongly deviates from Poisson (e.g., too many 0 counts, too few 1 counts). PMF denotes the probability mass function. B) Comparisons on a 200MB Drosophila skimming dataset between subsampling n^9/10 reads or bootstrapping n reads where n ≈ 10⁶ is the number of reads. In both cases, 100 replicates were generated, and distances were computed using Skmer. Graph shows all pairwise distance values between *Drosophila ananassae* and other Drosophila species. *Subsampling corrected* distances are using our method, as discussed in the text.

Why Subsampling?

An alternative to bootstrapping is subsampling without replacement: Given a set of data points, subsample them at random, apply the estimator to the subsample, and repeat to get a distribution of estimates. As detailed by Politis et al. (1999), subsampling is a sound way of measuring estimator uncertainty and crucially depends on fewer assumptions than bootstrapping while often providing comparable power. A subsample of the reads (or k-mer s) is equivalent to another genome skim with lower coverage and does not violate the assumptions of Skmer, designed explicitly for low coverages. Thus, subsampling will generate unbiased estimates (Fig. 1B). However, despite being unbiased, subsampling leads to incorrect variance. Thus, the distribution of estimates from subsamples needs to be corrected so that it asymptotically matches the correct distribution, $D$ . To see the need for this correction, note that the variance of the estimator increases as the amount of data decreases. Thus, the subsampled estimates will have a higher variance than $D$ , and the overestimation of variance will depend on the size of the subsamples. If not appropriately corrected, the arbitrary size of subsamples will determine the variance, which is clearly undesirable.

We base our method on what we call subsampling theorem for short (Theorem 2.2.1 in Politis et al., 1999). Let ${\hat{θ}}_{n}$ be a statistically consistent estimate of a parameter θ based on n data points (no assumption of independence is necessary). Assume that there exist some sequence of numbers τ such that $τ_{n} ({\hat{θ}}_{n} - θ)$ weakly converges to some distribution as n → ∞ (assumption 2.2.1 of Politis et al., 1999). The τ_n factor can be informally considered the rate of convergence of the estimator. Similarly, let ${\hat{θ}}_{b}$ be the estimate from a subsample of b data points sampled from the original n. To paraphrase informally, according to the subsampling theorem, under very general regularity conditions, the empirical cumulative distribution function of many $τ_{b} ({\hat{θ}}_{b} - θ_{n})$ estimates converges to cumulative distribution function (CDF) of $τ_{n} ({\hat{θ}}_{n} - θ)$ as n → ∞ assuming that b → ∞ and b / n → 0. In other words, to approximate the distribution of (unobtainable) estimates centered by the true value, we can examine the observable distribution of subsampled estimates centered around the main estimate, scaled by $τ_{b} / τ_{n}$ .

Subsampling Procedure

We use reads (for a genome skim) or k-mer s (for an assembly) as the atomic data units that are subsampled, and let n denote the number of such units. To use the sub-sampling theorem, we need the appropriate choice of τ_n . By the central limit theorem (CLT), given i.i.d random samples drawn from a population with mean μ and variance σ², the limit $\lim_{n \to \infty} \sqrt{n} (({\bar{X}}_{n} - μ) / σ)$ converges to a standard normal distribution where ${\bar{X}}_{n}$ is the sample mean. Therefore, for any estimator $\hat{θ}$ that can be written as the mean of some random variables, $\sqrt{n} (\hat{θ} - θ)$ converges to a Gaussian distribution making $τ_{n} = \sqrt{n}$ the correct normalizing factor.

While the coverage computation step of Skmer uses a complex estimator that cannot easily be described as a mean of random variables, the Jaccard calculation can be approximately described as such. For each k-mer , consider a binary random variable X_i indicating whether it is shared between the two genomes, and note Pr[X_i = 1] = J . Every k-mer can be considered a random sample from this distribution. Then, the Jaccard computed from L k-mer s is $J = \sum_{i} X_{i} / L$ . Thus, ignoring dependence of k-mer s, our Jaccard estimates do follow the CLT and admits the $τ = \sqrt{n}$ correction. We will use $τ = \sqrt{n}$ , admittedly ignoring the first part of the Skmer procedure in deriving this correction factor (more on this later).

Algorithm 1.

Subsampling procedure. Input: a set of skims ${S_{1} \dots S_{N}}$ with n₁ … n_N reads, a fixed parameter α < 1, and a tool (e.g., Skmer) $f ({S_{1} \dots S_{N}})$ to compute all pairwise distances.

${\hat{d}}_{i j} \leftarrow f ({S_{1} \dots S_{N}})$	▷Run a method like Skmer to compute all pairwise distance
for r ∈ 1… m do
for i ∈ 1… N do
$b_{i} \leftarrow {(n_{i})}^{α}$
$S_{i}^{r} \leftarrow$ a random subsample of size b_i of S_i.
$d_{i j}^{r} \leftarrow f ({S_{1}^{r} \dots S_{N}^{r}})$	▷Compute distances from subsamples
$y_{i j}^{r} = \sqrt{\frac{b_{i} + b_{j}}{n_{i} + n_{j}}} (d_{i j}^{r} - \bar{d_{i j}}) + {\hat{d}}_{i j}$	▷First correction: using main estimate
$x_{i j}^{r} = \sqrt{\frac{b_{i} + b_{j}}{n_{i} + n_{j}}} (d_{i j}^{r} - \bar{d_{i j}}) + \bar{d_{i j}}$	▷Second correction: using mean estimate
for r ∈ 1…m do
$t_{c}^{r}, t_{m}^{r} \leftarrow$ Distance-based trees inferred using $x_{i j}^{r}$ and $y_{i j}^{r}$ , respectively Using tools such as FastME
t_c ← Extended majority consensus of $t_{c}^{1} \dots t_{c}^{m}$
t_m ← A distance-based tree inferred using $\hat{d}$ distances
Assign to each branch e of t_m and t_c support $\frac{1}{m} \sum_{1}^{m} [e \in t_{m}^{i}]$ and $\frac{1}{m} \sum_{1}^{m} [e \in t_{c}^{i}]$ , respectively

Open in a new tab

We propose the following procedure (Fig. 2A). Given is a set of N genome skims S_i each with n reads (see Algorithm 1 for a relaxation where each sample has a different number of reads). We choose a constant α < 1 (default α = 9 / 10) to set b = n^α noting b → ∞ and b / n → 0 as n → ∞. We perform m (user-provided) rounds of subsampling. In each round r, we subsample b reads uniformly at random for each skim i and compute distances between these subsampled skims, giving us an estimate $d_{i j}^{r}$ for each pair i, j of skims. These distances need to be next corrected and used for estimating a tree per replicate. For getting the final tree, we can use either the main estimates with no sub-sampling, ${\hat{d}}_{i j}$ , or the related quantity $\bar{d_{i j}}$ , defined as the average distance (mean estimate) across all m sub-sample replicates. From the subsampling theorem, we can infer that $\sqrt{n^{α} / n} (d_{i j}^{r} - {\hat{d}}_{i j}), \sqrt{n^{α} / n} (d_{i j}^{r} - \bar{d_{i j}}), ({\hat{d}}_{i j} - d_{i j})$ , and $(\bar{d_{i j}} - d_{i j})$ all asymptotically converge to the same distribution. Accordingly, we consider two expressions for correcting distance. First, when the final tree is inferred from main estimates, we center all the subsampled distances around zero, apply the correction, and then center them back around the main estimate:

y_{i j}^{r} = \sqrt{\frac{n^{α}}{n}} (d_{i j}^{r} - \bar{d_{i j}}) + {\hat{d}}_{i j}

(1)

Figure 2: — A) Subsampling workflow. Every sample with n reads is subsampled m times to generate replicate data with b = n^α reads. Obtained pairwise distances $d_{i j}^{r}$ are corrected with τ_b / τ_n centered by either mean ( ${\bar{d}}_{i j}$ ) or main estimates ( ${\hat{d}}_{i j}$ ). B) Workflow of Felsenstein zone simulations.

The alternative is to use the extended majority rule (i.e., greedy) consensus of the m replicate trees as the final tree. Since this final tree does not refer to the main distances, we have no reason to use ${\hat{d}}_{i j}$ in the correction and instead use:

x_{i j}^{r} = \sqrt{\frac{n^{α}}{n}} (d_{i j}^{r} - \bar{d_{i j}}) + \bar{d_{i j}}

(2)

Finally, when given two assemblies, we apply the subsampling procedure at the k-mer level. Because Skmer computes Jaccard using the min-hash technique (see STAR Methods), which boils down to subsampling k-mer s to a sketch size K, we simply repeat the sketching process m times, with sketch sizes smaller than the original (i.e., K^α), to get a distribution of distances.

Simulation results

We start by benchmarking our method on several simulated datasets (see STAR Methods for details).

Distance variance when simulating genome pairs

We first test whether subsampling accurately captures variance in distance estimates. Recall that the definition of correct support is to match the distance distribution $D$ when sequence evolution is repeated. In simulations, we can estimate $D$ by repeating the simulation procedure. Thus, we compare the distribution from subsampling a single run with the distribution obtained across simulation replicates.

The subsampling procedure, with the square root correction, manages to obtain distance distributions close to the ideal distribution $D$ obtained from simulations (Fig. 3A). Note that the change in the center of the distribution is expected, as the subsampled distribution is obtained for one simulation replicate and is centered around its main estimate, ${\hat{d}}_{i j}$ . Notably, the variance of the subsampled procedure closely matches the simulated distribution $D$ as long as the true distance is within the 0.02–0.15 range (Fig. 3B). For small distances (d = 0.01), subsampling noticeably overestimates the variance, and conversely, for large distances, it slightly underestimates variance. Importantly, the correction is essential for getting reasonable results; when distance values were left uncorrected (i.e., eliminating $\sqrt{b / n}$ ), the variance produced by subsampling was dramatically larger than $D$ (Fig. 3B).

Figure 3: — A) Distance distributions scaled by true distance obtained across 100 simulations or 100 subsampling rounds in one simulated replicate, with and without correction using equ. (1). B) Variance in distance estimates in simulations (e.g., the ideal distribution $D$ ) versus subsampled data. C) Impact of distance correction versus α. The x-axis label denotes α on top and the corresponding coverage on the bottom. The original sample had 2x coverage. * shows the default α. Values on top panels correspond to the true distance between pairs of samples in the simulated tree.

The distribution of distances should ideally not depend on α (thus, b), and we observe that correction dramatically reduces the dependence of distributions on α (Fig. 3C). For typical distances (e.g., 0.05), the variance of the distances shrinks very slightly as α increases. The (slightly underestimated) variance of large distances (0.25) does not seem to depend on α. Only in the case of a small distance (d = 0.01), where variance is slightly estimated with default α do we observe a noticeable impact as the number of subsampled reads varies. Results indicate that for small distances, while the $\sqrt{b / n}$ correction is effective in dramatically decreasing the gap between ideal and observed variance, it is not exact. When we start with 16x coverage, similar patterns are observed; however, the variance does substantially reduce when subsampled coverage goes above 4x (Fig. S1), a point we further discuss later. Overall, the corrections seem effective, if imperfect, in eliminating the effect of subsampling on the variance of the estimator.

Simulation under Felsenstein-zone quartet trees with long branch attraction

We next evaluate the calibration of branch support values drawn on final trees obtained using either consensus (1) or the main (2) estimates on a challenging dataset that simulates conditions prone to long branch attraction using the Felsenstein-zone quartet trees (see STAR Methods). We say that supports are fully calibrated if they perfectly correspond to the probability of correctness. Support values produced by both correction methods tend to be underestimated (Fig. 4A). For example, branches with 70-80% support are correct 86% of the times using the main estimates and 100% using consensus. While support underestimation is more severe for consensus, we observe an unusually high percentage of correct trees with support < 50 % with the main estimate. Overall, these support values are conservative.

Figure 4: — Results are across 1000 replicates. A) Percentage of correctly inferred topologies (y-axis) for each of seven bins of branch support (x-axis). The percentage of correct branches in each bin should ideally match the median of that bin (dotted line). B) Empirical cumulative distribution function (ECDF) of support values divided between correct and incorrect branches. C) Receiver operating characteristic (ROC) curve built by considering branches with support below thresholds 0-100%, with 1% increments (see Fig. 2B for the definition of the confusion matrix). Selected thresholds are shown on the graph. D) Effect of branch length on estimated support.

Beyond calibration, we interrogate the predictive power of support values (see STAR Methods) and observe that support values are predictive of accuracy (Fig. 4B C). There is a large gap between the support distribution of correct and incorrect branches, and the gap is more prominent for consensus than the main-estimate approach (Fig. 4B). Moreover, with the consensus method, increased support perfectly correlates with increased accuracy (Fig. 4A) so that all incorrect topologies have support below 75%, and more than 50% of correct branches have support above 75%. Constructed ROC curves (obtained from confusion matrices built using thresholds of support; see Fig. 2B) show that very low false positive rates and high recall can be obtained by examining branches with > 70% support (Fig. 4C). Moreover, consensus is substantially more powerful in discerning correct and incorrect branches than the main estimate. Finally, note that, as expected, the value of support depends on the ratio between long and short branches (Fig. 4D). Support is invariably low when long branches are two to three orders of magnitude bigger than small branches.

Simulations on 8-taxon trees with model misspecification.

We next assess the branch support on a simulated 8-taxon dataset meant to challenge our approach by simulating genomes that deviate from the assumptions of the models used for inference. In these simulations (see STAR Methods), model misspecification can make the Skmer estimator biased; thus, the errors in these trees are not just due to variance (of the estimator) but also bias, and there is no theory to suggest that the subsampling procedure can overcome bias. Nevertheless, we evaluate the impact of bias empirically.

With model misspecification, higher support > 50% tends to be overestimated while support values ≤ 50% are underestimated (Fig. 5A). Despite the tendency of the method to overestimate support for much of its range, the highest support values are reasonably reliable. In particular, among branches with 100% support, 95% are correct. Thus, unlike the previous simulations with an unbiased estimator, when the estimator is biased, a small but considerable portion of branches with 100% support are incorrect.

Figure 5: — Results are across 120 replicates under Jukes and Cantor 1969 (JC69) and Felsenstein’s 1981 (F81) models, and figures are identically set up to Figure 4.

There appears to be a weak dependency between rate variation and the accuracy of the tree and its support value (Fig. S2A). The portion of incorrect branches with 100% support are noticeably higher for trees with the highest rate variation parameters for the main estimates (though not for consensus). Also, all trees with three or more incorrect branches out of five were among replicates with higher rate variation parameters (Fig. S2B). However, beyond these extreme cases, no strong correlation between rate variation and accuracy was observed.

Despite being imperfect, the support values do have the predictive power to distinguish branch correctness (Fig. 5B C). We observe a large separation between the distributions of support values of correct and incorrect branches (Fig. 5B). As expected, incorrect branches have a wide range and seem uniformly distributed, except for an overabundance of roughly a quarter of incorrect branches that have support close to 100%. Thus, in contrast to previous simulations, the introduction of bias makes false positive rate (FPR) values below 0.25 unavailable; however, the recall does not seem to have been negatively impacted (compare Figs. 4C and 5C). All three ways of evaluation show similar trends regardless of the type of correction used, with the main-estimate correction performing slightly better.

When separating our replicates into those based on fully balanced and fully unbalanced trees (see STAR Methods), we observe only minor differences in overall support patterns (Fig. S2C). Balanced trees tend to have somewhat higher FPR at 100% support but also higher recall, indicating that support overestimation is slightly more severe for balanced trees. The same pattern emerges when we examine branch lengths. Just as in previous simulations, the lower estimated support values tend to be mostly among short branches; however, balanced trees have higher support than caterpillar trees for branches of the same length (Fig. S2D).

Results on Real datasets

Cetaceans: low resolution within species.

We use two cetacean datasets (see STAR Methods) to contrast support for relationships within species (presumably uncertain) versus across species (presumably, highly confident). We see vastly different support patterns for trees inferred within or across species (Figs. 6A B, S3B). The tree inferred for samples of the Mesoplodon grayi species has low support assigned to the majority of its branches, while the tree inferred across species has full support for every branch. This result is in line with expectations since the signal of tree-like evolution between individuals of the same species should be weak (or non-existent) and the signal across species should be strong given the use of assembled genomes. Thus, our support values show that most of the inferred relationships between individuals of the Mesoplodon grayi are biologically unreliable, and these individuals should be considered part of a freely mixing population. This result is in line with the conclusions of Westbury et al. (2021) which used a host of analyses that revealed little to no population structure. The assembly-free tree inferred across cetacean species (obtained in less than an hour, including the time to download the genomes) is highly congruent with the alignment-based trees inferred using a much more complex pipeline by McGowen et al. (2020); thus, the full support across all branches is reasonable.

Figure 6: — Branches with no values assigned signify 100% support. The left value corresponds to support for tree corrected with the main estimate, and the right value denotes consensus. The thickness of the branch corresponds to the magnitude of associated support. Red denotes controversial branches as compared to reference phylogenies. Blue denotes alternative topology between main and consensus trees. A) Whale phylogeny inferred from assemblies. B) Whale tree computed for multiple organisms of the same species of Gray’s beaked whale *Mesoplodon grayi*. C) Bee phylogeny built from assemblies. D) Bee phylogeny obtained from simulated at 1x sequencing reads. E) Drosophila phylogeny built from assemblies. F) Drosophila phylogeny obtained from short-read SRAs.

Low support for conflicting results.

We next use three insect datasets (lice, bee, and drosophila; see STAR Methods) to investigate whether the lack of support corresponds to the difficulty of resolving branches. On all three datasets, we consistently observe that branches with low support in Skmer-based trees are those that conflict with alignment-based trees published in the literature.

On the lice dataset, the phylogeny inferred from genome skims using the main correction has five branches with support below 100% (Fig. S3C), and these are the only branches that differ from the ASTRAL (Mirarab et al., 2015) tree reported by Boyd et al. (2017) (after removing five outgroup species with no skims). Using consensus-based trees also produces support below 100% for all branches disagreeing with the ASTRAL tree but has two extra branches with low support (Fig. S3D).

On the bee dataset, where we have both assemblies and simulated genome skims, we obtain identical tree topologies from both data types (Fig. 6C D). As one would expect, support values are higher for assembly-based trees (all branches 100%, except for three, which are above 92% when computed using main-estimate) than genome skims (four branches below 90% with the main estimate). Interestingly, all branches that deviated from the alignment-based tree reported by Sun et al. (2021) have support below 100% in both of our trees. One of the two conflicting branches, the placement of B. breviceps is the only branch without full bootstrap support in the Sun et al. study, the only branch where ASTRAL and concatenation disagreed, and also, one of the only two branches without full support in the ASTRAL tree by Sun et al. (2021). These results are robust if we keep the number of replicates between 50 to 1000 (Fig. S4AB).

On Drosophila dataset, we also have both trees based on assembly and genome skims (Fig. 6E F). The tree generated by genome skims aligned perfectly with the alignment-based tree reported by Miller et al. (2018), and all of its branches had 99% or 100% support with both ways of computing support. The tree based on assemblies and using the main correction, however, had two branches with low support (64% and 45%), and one of the two branches (uniting D. sechellia and D. simulans) was different from the Miller et al. (2018) tree. Interestingly, the Skmer-based resolution matches what VAN DER LINDE et al. (2010) had proposed earlier based on 13 marker genes but better taxon sampling. The second difference is the position of D. biarmipes, which matches between Miller and our tree, but also differs from the Vanderlinde tree. Interestingly, the consensus correction produces the Vanderlinde resolution using our data, albeit with only 94% support. In summary, both branches without full support in consensus-based and main-based trees show conflicting results within the literature.

Impact of coverage.

Next, to test how the coverage level of a genome skim impacts its power to resolve phylogenetic relationships, we reused the bee dataset but with varying coverage levels (Fig. S4CD). As expected, with higher coverage of the genome skim (i.e, before subsampling), reported support values increase. However, even with coverage as high as 8x, the branch that showed signs of conflict in the literature (i.e., placement of B. breviceps) did not reach full support. The results indicate that the lack of support for this branch was not simply due to lack of coverage.

Support versus length.

Examining the branch length versus support across all datasets with genome skims, we observe an interesting pattern (Fig. S3A). While shorter branches, especially those smaller than 0.001, are frequently associated with lower support, the association is not perfect. For example, among branches with length in the 0.0005 – 0.001 range, support ranges between 50% and 100%. Thus, while short branches are clearly more uncertain than longer ones, not all short branches are equally uncertain.

Discussion

We have shown that subsampling, unlike resampling, represents a reliable approach to derive statistical support for distances and phylogenies estimated using assembly/alignment-free methods. On real data, we observed that lower support values were associated with controversial branches. On the simulated datasets, we saw that while branch support is not always calibrated with the probability of correctness (tends to be underestimated when there is no model violation and overestimated when there is), it is powerful (i.e., distinguishes correct and incorrect branches). Note that a method of support estimation can be very powerful and yet not be calibrated (e.g., consider a case where the estimated support is always half the probability of correctness). The lack of calibration is not unique to our subsampling procedure. Bootstrapping has long been believed to underestimate support (Hillis and Bull, 1993), leading many biologists to use 75% support as a threshold of high support. It appears that our subsampling method has similar tendencies to underestimate support, at least in the absence of model violations.

We observed that the variance of subsampled distances is slightly over and underestimated for very small and long distances, respectively. This apparent bias is likely related to the rate of convergence formula we used; i.e., $τ_{n} = \sqrt{n}$ . Recall that our choice of τ had a justification only for the Jaccard estimation step and not the coverage estimation step. We can empirically assert that our choice was an appropriate choice for most distance ranges but not all (Fig. 3B). It is possible that the rate of convergence of Skmer is higher than $\sqrt{n}$ for small distances and slightly lower for large distances. While a more accurate choice of τ_n requires further theoretical work, we note that the current choice produces reasonable results.

Just like any other method of support estimation, our method has theoretical guarantees only for statistically unbiased estimators. It has been long appreciated in studying evolution that systematic bias leads to support overestimation, as manifested by the confident incongruence between competing phylogenies (Sanderson et al., 2000; Phillips et al., 2004; Philippe et al., 2017; Jeffroy et al., 2006) or increased support of genome-wide data compared to single genes (Taylor, 2004). Our method is not an exception. When we simulated data with model misspecification, we also observed some strongly supported incorrect branches. In particular, about 5% of branches with 100% seemed to be incorrect and 25% of incorrect branches had 100% support. Thus, the interpretation of biological results should keep the possibility of model violation in mind. Note that the bias is coming from the estimator (here, Skmer+JC model), not the subsampling procedure; thus, the only principled solution is to design improved estimators under more complex models. Such estimators have been suggested in the past (Pham and Zuegg, 2004; Fan et al., 2015) and are being actively developed (Criscuolo, 2019; Balaban et al., 2022a). However, it should be noted that more complex models do not always increase accuracy. As a proof of concept, we reanalyzed three datasets (8-taxon simulations, Drosophila, and Bee real datasets) under the F81 model of evolution, which accounts for variable base frequencies. This more complex model did not substantially change our results in any of the datasets (Figs. 5, S5A–D).

A more subtle form of model violation is the presence of repeats in the genome, which the Skmer approach mostly ignores. To test the impact of repeated regions on reported support estimates, we removed repeats from Drosophila and Bee assemblies using RepeatMasker (Smit et al., 2013) and recomputed phylogenies and support values. The set of uncertain relationships (i.e., those with less than 100% support) do not change after repeat filtering (Fig. S5EF). For Bee genomes (Fig. S5E), where repeat content did not exceed 2.88% (Table S1), computed tree topology remained unchanged and was identical for main and consensus approaches. However, the support values for one branch did change from high (94%) to low (66%). For Drosophila (Fig. S5F), identified repeat content was higher and more varied, ranging between 2.8–10.3% (Table S2). On these data, both types of correction result in the same tree topologies, which was not the case before repeat masking. The support for the only branch that changed after masking went from low (45%) with one resolution to moderate (70%) with another resolution. Thus, our findings suggest that the repeat content may affect phylogeny estimation and support. Luckily, these changes seem to be mostly around branches with low support, as identified by our measure. Thus, our support values give users ways to identify unreliable branches.

Like bootstrapping, subsampling requires running the procedure many times and can increase the running time. Using a machine with AMD EPYC 7742 2.25 GHz CPU using 24 threads and 120G of RAM, our running times were manageable (m = 100 in our case). For example, for the lice dataset with 61 samples, each 400MB, the entire process took 7 hours. Drosophila and bee datasets with fewer species and smaller sketches each took around 49 minutes. Thus, subsampling, while not fast, requires reasonable time. We also did not detect any benefit from further increasing m and hence the running time (Fig. S4AB). We note that our tool allows embarrassing parallelization through -i and -b options, which we did not use here.

The subsampling procedure is integrated within the Skmer software and produces both main and consensus estimates. The simulations showed the consensus to be more accurate only when the model was not violated. Since the main estimates have been found to be more accurate elsewhere in phylogenetics (Mirarab et al., 2016) and here on data with model misspecification, they should still be considered a better choice. On our real datasets, in all cases, except the beaked whale genome dataset with no resolution, the two methods produced similar trees. In practice, we advise the users to try both methods and consider any branch with low support in either analysis as suspicious.

Our study has several limitations that future work should address. Most of our tests were on genome skims with low coverage, which is the most challenging case. However, testing the simpler case of genomes with low coverage is also important. We note that Skmer changes how it computes coverage when the coverage seems to be above 4x. This change in the algorithm makes the use of subsampling tricky, as the subsampled data may use a somewhat different estimator than the main. This observation may explain the pattern observed for 16x coverage analyses (Fig. S1). Note that the use of the consensus method can ameliorate this problem as it ensures all estimates are using the same method. Thus, we suggest using the consensus method and selecting α carefully for datasets with high coverage to ensure that the (estimated) coverage is either consistently above 4x or always below it.

We conclude with a note about the generalizability of the method. The resampling method tested here can be used without much change for other distance-based assembly-free methods. Beyond assembly-free methods, any distance-based method, including alignment-based methods, can also take advantage of this approach. In fact, deep learning methods of distance estimation have already started to adopt our proposed approach (Jiang et al., 2022). For example, alignment-based methods could compute distances based on subsets of sites and correct distances using the equations we noted. We leave it to future work to compare the accuracy of bootstrapping and subsampling in such conditions where both methods are applicable.

STAR Methods

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Siavash Mirarab (smirarab@ucsd.edu).

Materials availability

This study did not generate new materials.

Data and code availability

This paper analyzes existing, publicly available data. All original studies are referenced in the main text. Accession numbers for the datasets are available in this paper’s supplemental information.
The software is available publicly at https://github.com/shahab-sarmashghi/Skmer. Raw data and summary of results are deposited in https://github.com/noraracht/subsample_support_scripts. The DOI is https://doi.org/10.5281/zenodo.6473473.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

Method details

Skmer

Because we focus on Skmer as the estimator, we review the essential aspects of this estimator. Skmer has two stages and a final calculation.

Stage 1: Skmer uses k-mer frequency profiles (computed using JellyFish (Marçais and Kingsford, 2011)) to estimate the amount of sequencing error and the coverage (neither of which is known). Let M_i be the number of k-mers observed i times in the genome-skim and assign $h = \underset{i \geq 2}{argmax} M_{i}$ . Also, let $ξ = \frac{M_{h + 1}}{M_{h}} (h + 1)$ . Skmer estimates:

λ = \frac{M_{1}}{M_{h}} \frac{ξ^{h}}{h!} e^{- ξ} + ξ (1 - e^{- ξ}) ϵ = 1 - {(ξ / λ)}^{1 / k}

(3)

as the k-mer coverage and sequencing error rate, respectively. Then, for i ∈ {1, 2}, let $η_{i} = 1 - e^{- λ_{i} {(1 - ϵ_{i})}^{k}}$ and $ζ_{i} = η_{i} + λ_{i} (1 - {(1 - ϵ_{i})}^{k})$ (for high coverage, it defines ζ_i and η_i differently; see the original paper). Also, let L_i be the estimated genome length (total sequencing amount divided by coverage). By default, Skmer sets k = 31, which we keep. Balaban and Mirarab (2020) have shown that a reduced k can improve accuracy for assemblies, but we used a fixed value since measuring inaccuracies, not increasing the accuracy, is our focus.

Stage 2: Skmer uses Mash to compute the Jaccard index J between two skims.

Final calculation: Skmer computes the genomic distance between two genome skims with Jaccard similarity J and genomic parameters estimated above using

\hat{d} = 1 - {(\frac{2 (ζ_{1} L_{1} + ζ_{2} L_{2}) J}{η_{1} η_{2} (L_{1} + L_{2}) (1 + J)})}^{1 / k} .

(4)

Skmer can also handle input assemblies by dividing them into k-mer s, and omitting the correction for low coverage and sequencing error by setting ζ_i = 2 and η_i = 1.

Once the hamming distance is estimated, the phylogenetic analyses use the corrected distance obtained using the standard Jukes and Cantor (1969) correction: $t = \frac{4}{3} \log (1 - \frac{4}{3} \hat{d})$ . We use t for phylogenetic inference.

Subsampling

Unless otherwise specified, we used Algorithm 1 with α = 9/10 and m = 100, estimated distances using Skmer (default settings, but increasing sketch size to 10⁶ for simulated data), corrected distances using the JC model, and used FastMe (Lefort et al., 2015) with default settings to compute trees. We used RAxML (Stamatakis, 2014) to draw support and compute extended majority rule consensus trees (Appendix Appendix A.1). Note that this approach is counting bipartitions in a binary fashion and can be sensitive to rouge taxa; alternatives exist and can be adopted in the future (Lemoine et al., 2018).

Datasets

We evaluated the performance on three simulated and several real biological data.

Genome pairs.

We used genome pairs to test the validity of distance distributions. We simulated 100Mbp genome pairs at phylogenetic distance t ∈ {0.01, 0.02, 0.05, 0.10, 0.15, 0.25}, repeating the procedure 100 times (so 700 genome pairs), and simulated at 2x and 16x coverage genome skims (Appendix Appendix A.1). Note that these distances are in the unit of the expected number of substitutions per site. Next, for each distance t, we arbitrary selected the 7th sample and used the subsampling procedure to generate a distance distribution, which we then compared against $D$ obtained across 100 simulation runs. Note that the choice of α should ideally not change the distance distribution (if the correction factor $\sqrt{b / n}$ is correct). To empirically test this assumption, we performed extra experiments for true distances at {0.01, 0.05, 0.25}. We repeated the subsampling procedure with α ∈ {4 / 5, 6 / 7, 9 / 10, 13 / 14, 20 / 21} leading to coverage {0.11, 0.25, 0.47, 0.71, 1.00} when starting from 2x and {0.58, 1.50, 3.05, 4.89, 7.26} starting from 16x. Skmer failed to estimate coverage and genome length in rare cases, and these replicates were excluded from the analysis.

We used INDELible (Fletcher and Yang, 2009) to generate long sequences (representing genomes) under the Felsenstein (1981) (F81) model with uneven base frequencies 0.26, 0.21, 0.24, 0.29 for T, C, A and G and no indels. Note that the F81 model with base frequencies so close to 1 / 4 is extremely similar to the JC model used for phylogenetic distance inference by Skmer. We used ART (Huang et al., 2012) with its default short-read error profile to produce low-coverage error-prone reads that simulate genome skims (Appendix Appendix A.1).

Felsenstein-zone quartet trees.

We simulated a challenging dataset meant to create conditions prone to long branch attraction. First, we generated 1000 quartet trees with branch lengths close to the Felsenstein zone; i.e., three short branches and two separate long branches (Fig. 2B). For each replicate, we draw short lengths from log-uniform distribution spaced between 0.0001 and 0.001 and draw the long lengths from a uniform distribution between 0.05 and 0.12 (Fig. S6AB). Once the trees were available, we used INDELible, the F81 model with, and ART with settings identical to the genome pairs experiments.

8-taxon simulations with model misspecification.

We simulated an 8-taxon dataset with a procedure similar to the previous datasets, with two changes to the model: 1) we used the GTR model of Tavaré (1986), which can greatly violate the assumptions of the JC model used for estimation, 2) we added (unmodelled) rate variation across sites using the standard Gamma model of rate variation (Jin and Nei, 1990). In addition, this dataset included 8-taxa, necessitating the choice of a topology: we used both fully balanced and fully unbalanced (i.e., caterpillar) topologies and simulated 120 replicates for each. Branch lengths were randomly selected from the log-uniform distribution spaced between 0.00001 and 0.12 (Fig. S6C). We used Dirichlet (19, 14, 14, 19) to draw the base frequencies for A, C, G, and T. Entries of the GTR matrices were drawn from Dirichlet (50, 7, 12, 12, 14, 50) for C ↔ T, A ↔ T, G ↔ T, A ↔ C, G ↔ C, G ↔ A. To set rates across-sites, the 1-centered Gamma model was used with α drawn from a log-normal distribution with log mean 2.55 and log standard deviation 0.18. Note that α is the inverse of the variance of the Gamma distribution of rate multipliers. All these hyperparameters were estimated using maximum likelihood estimation from a collection of published gene trees from the ruminant genomes project (Chen et al., 2019).

Measuring accuracy on simulated datasets

To evaluate the accuracy, after computing support using Algorithm 1, we bin all 1000 replicates into seven groups based on the support value of their only internal branch and compute how often trees in each bin are correctly inferred (Fig. 2B). If supports values are calibrated, the percentage of correct replicates in each bin should be close to the midpoint of that bin. However, even when support values are not calibrated, they can still be predictive (i.e., informative in separating correct and incorrect branches). We evaluate the power of support in distinguishing correct/incorrect branches using ROC curves (Fig. 2B) and drawing the support distribution for correct/incorrect trees. We also show the empirical cumulative distribution function (ECDF) of the support of correct and incorrect branches.

Real biological datasets.

We use five biological datasets, including those with assemblies, raw reads, and simulated genome skims.

Cetaceans.

We used two cetaceans datasets, one for within species and one for cross-species relationships. We used all 27 cetacean assemblies from NCBI (Table S3) to create the across-species dataset. For within-species, we used the dataset of 20 low-coverage genome skims of Gray’s beaked whale Mesoplodon grayi (Table S4) published by Westbury et al. (2021). We removed adapters, deduplicated these samples, and merged paired-end reads using BBTools (Bushnell et al., 2017) (Appendix Appendix A.1). Since contaminates can impact distances (Rachtman et al., 2020), we filtered reads that do not align to a reference genome Mesoplodon bidens using Bowtie2 (Langmead and Salzberg, 2012; Langmead et al., 2019) (sensitive setting ; see Appendix Appendix A.1). For genome skims, we used α = 20 / 21 instead of the default 9 / 10 to accommodate the low coverage of these samples before subsampling (0.78–0.96x).

Insect datasets.

We used 19 bee reference genomes (Sun et al., 2021) (Table S5) and simulated genome skims at 1x, 2x, 4x, and 8x coverage using ART. We also used 14 Drosophila assemblies as well as their corresponding high-coverage SRAs (Table S6), which we downsampled to 200 MB using seqtk (Li, 2018) to get real genome skims. Assemblies were used without preliminary processing. For real skims, we used a pipeline similar to beaked whales to remove adapters, deduplicated reads, and merge paired-end (Appendix Appendix A.1). We filtered out human reads using Kraken (Wood et al., 2019) and filtered out microbial contaminants by querying them against the GTDB database using CONSULT (Rachtman et al., 2021). The rest of the procedure was identical to beaked whale genome skims, but we used the default α = 9 / 10.

We also used a real Lice by Boyd et al. (2017) where the original paper provides an alignment-based tree to test whether our support values can identify questionable branches. We collected 61 high coverage lice SRAs (Table S7) and downsampled them to 400 Mbp to emulate genome skims. We removed adapters, deduplicated these samples, and merged paired-end reads (Appendix Appendix A.1) but no filtering of contaminants was performed.

Supplementary Material

Supplemental Information

NIHMS1824003-supplement-Supplemental_Information.pdf^{(899.6KB, pdf)}

Key resources table

REAGENT or RESOURCE	SOURCE	IDENTIFIER
Antibodies
Bacterial and virus strains
Biological samples
Whale sequencing reads	Westbury et al., 2021	Table S4
Bee sequencing reads	Sun et al., 2021	Table S5
Drosophila sequencing reads	Miller et al., 2018	Table S6
Lice sequencing reads	Boyd et al., 2017	Table S7
Chemicals, peptides, and recombinant proteins
Critical commercial assays
Deposited data
Experimental models: Cell lines
Experimental models: Organisms/strains
Oligonucleotides
Recombinant DNA
Experimental models: Cell lines
Recombinant DNA
Software and algorithms
Skmer code	This paper	https://github.com/shahab-sarmashghi/Skmer
ART	Huang et al., 2012	https://www.niehs.nih.gov/research/resources/software/biostatistics/art/
Seqtk	N/A	https://github.com/lh3/seqtk
Kraken2	Wood et al., 2019	https://github.com/DerrickWood/kraken2
Bowtie2	Langmead et al., 2019	https://sourceforge.net/projects/bowtie-bio/files/bowtie2/
CONSULT	Rachtman et al., 2021	https://github.com/noraracht/CONSULT
BBTools	Bushnell et al., 2017	https://sourceforge.net/projects/bbmap/files/
INDELible	Fletcher and Yang, 2009	http://abacus.gene.ucl.ac.uk/software/indelible/
FastME	Lefort et al., 2015	http://www.atgc-montpellier.fr/fastme/
RAxML	Stamatakis, 2014	https://github.com/stamatak/standard-RAxML
RepeatMasker	N/A	http://www.repeatmasker.org/
Other
Data and summary of analyses	This paper	https://github.com/noraracht/subsample_support_scripts https://doi.org/10.5281/zenodo.6473473

Open in a new tab

Box 1: Progress and Potential.

Comparison of genomic sequences is typically performed by aligning sequences against each other. However, in many applications of next generation sequencing (NGS) data, which consist of reads from short fragments of DNA, building alignments is either costly or impossible. For example, ecologists have started to probe biodiversity of an ecological environment using whole-genome sequencing and a low-cost technique called genome skimming that only creates partial data from each organism at levels insufficient for building longer sequences required for alignment. In such scenarios, we can still compare genomic reads without reliance on alignment using special algorithms. For example, we can represent a set of sequences by dividing them into small words of length k (the so-called k-mers) and compute the distance between genomes solely based on these smaller units. While the bioinformatics community has made big strides in developing methods for alignment-free sequence comparison, less attention has been paid to quantification of the uncertainty around those estimates. Since comparison of sequence data needs to always contend with issues of noise and incompleteness in the data, lack of robust methods for measuring uncertainty is a problem. For example, phylogenetic inference of evolutionary histories using alignment-free methods is adopted far less widely than the alignment-based methods, partially because inferred evolutionary histories need to be furnished with a notion of uncertainty before biologists can interpret them. In this paper, we introduce a new general method for measuring uncertainty of distances computed using alignment-free methods by adopting a technique well-known to statisticians to this particular domain. Our method simply consists of running a sequence comparison method on subsamples of the data and a “correction” step at the end. We show on simulated and real data that the method performs well on a range of datasets.

Highlights.

Propose to quantify uncertainty on k-mer based phylogenies by using subsampling
Develop a correction procedure to account for size of subsample replicate
Estimated distribution of distances matches the true uncertainty of the estimator
Computed support values can accurately identify ambiguous branches on phylogenies

Acknowledgements

This work was supported by the National Science Foundation (NSF) grant IIS-1815485 to ER, SS, VB, and SM, and NIH grant 1R35GM142725 to ER and SM. Computations were performed on the San Diego Supercomputer Center (SDSC) through the Extreme Science and Engineering Discovery Environment (XSEDE) supported by NSF grant number ACI-1548562.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Declaration of Interests

The authors declare no competing interests.

References

Allman ES, Rhodes JA, and Sullivant S (2017). Statistically Consistent k -mer Methods for Phylogenetic Tree Reconstruction. Journal of Computational Biology, 24(2):153–171. [DOI] [PubMed] [Google Scholar]
Balaban M, Bristy NA, Faisal A, Bayzid MS, and Mirarab S (2022a). Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. bioRxiv, page 2021.11.10.468111. [DOI] [PMC free article] [PubMed] [Google Scholar]
Balaban M, Jiang Y, Roush D, Zhu Q, and Mirarab S (2022b). Fast and accurate distance?based phylogenetic placement using divide and conquer. Molecular Ecology Resources, 22(3):1213–1227. [DOI] [PubMed] [Google Scholar]
Balaban M and Mirarab S (2020). Phylogenetic double placement of mixed samples. Bioinformatics, 36(Supplement_1):i335–i343. [DOI] [PMC free article] [PubMed] [Google Scholar]
Balaban M, Sarmashghi S, and Mirarab S (2020). APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments. Systematic Biology, 69(3):566–578. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bogusz M and Whelan S (2016). Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking. Systematic Biology, 66(2):218–231. [DOI] [PubMed] [Google Scholar]
Bohmann K, Mirarab S, Bafna V, and Gilbert MTP (2020). Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Molecular Ecology, 29(14):2521–2534. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boyd BM, Allen JM, Nguyen N-P, Sweet AD, Warnow T, Shapiro MD, Villa SM, Bush SE, Clayton DH, and Johnson KP (2017). Phylogenomics using Target-Restricted Assembly Resolves Intrageneric Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology, 66(6):896–911. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bushnell B, Rood J, and Singer E (2017). BBMerge ? Accurate paired shotgun read merging via overlap. PLOS ONE, 12(10):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen L, Qiu Q, Jiang Y, Wang K, Lin Z, Li Z, Bibi F, Yang Y, Wang J, Nie W, Su W, Liu G, Li Q, Fu W, Pan X, Liu C, Yang J, Zhang C, Yin Y, Wang Y, Zhao Y, Zhang C, Wang Z, Qin Y, Liu W, Wang B, Ren Y, Zhang R, Zeng Y, da Fonseca RR, Wei B, Li R, Wan W, Zhao R, Zhu W, Wang Y, Duan S, Gao Y, Zhang YE, Chen C, Hvilsom C, Epps CW, Chemnick LG, Dong Y, Mirarab S, Siegismund HR, Ryder OA, Gilbert MTP, Lewin HA, Zhang G, Heller R, and Wang W (2019). Large-scale ruminant genome sequencing provides insights into their evolution and distinct traits. Science (New York, N.Y.), 364(6446). [DOI] [PubMed] [Google Scholar]
Coissac E, Hollingsworth PM, Lavergne S, and Taberlet P (2016). From barcodes to genomes: extending the concept of DNA barcoding. Molecular ecology, 25(7):1423–1428. [DOI] [PubMed] [Google Scholar]
Criscuolo A (2019). A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. Research Ideas and Outcomes, 5:e36178. [Google Scholar]
Efron B (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1–26. [Google Scholar]
Fan H, Ives AR, Surget-Groba Y, and Cannon CH (2015). An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16(1):522. [DOI] [PMC free article] [PubMed] [Google Scholar]
Felsenstein J (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17(6):368–376. [DOI] [PubMed] [Google Scholar]
Felsenstein J (1985). Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution, 39(4):783–791. [DOI] [PubMed] [Google Scholar]
Felsenstein J and Kishino H (1993). Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Systematic Biology, 42(2):193–200. [Google Scholar]
Fletcher W and Yang Z (2009). INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8):1879–1888. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haubold B (2014). Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics, 15(3):407–418. [DOI] [PubMed] [Google Scholar]
Hillis DM and Bull JJ (1993). An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis. Systematic Biology, 42(2):182–192. [Google Scholar]
Höhl M and Ragan MA (2007). Is multiple-sequence alignment required for accurate inference of phylogeny? Systematic Biology, 56(2):206–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
Holder M and Lewis PO (2003). Phylogeny estimation: traditional and Bayesian approaches. Nature Reviews Genetics, 4(4):275–284. [DOI] [PubMed] [Google Scholar]
Huang W, Li L, Myers JR, and Marth GT (2012). ART: A next-generation sequencing read simulator. Bioinformatics, 28(4):593–594. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jeffroy O, Brinkmann H, Delsuc F, and Philippe H (2006). Phylogenomics: the beginning of incongruence? Trends in Genetics, 22(4):225–231. [DOI] [PubMed] [Google Scholar]
Jiang Y, Balaban M, Zhu Q, and Mirarab S (2022). DEPP: Deep Learning Enables Extending Species Trees using Single Genes. Systematic Biology, page 2021.01.22.427808. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jin L and Nei M (1990). Limitations of the evolutionary parsimony method of phylogenetic analysis. Molecular Biology and Evolution, 7(1):82–102. [DOI] [PubMed] [Google Scholar]
Jukes TH and Cantor CR (1969). Evolution of protein molecules. Mammalian protein metabolism, 3:21–132. [Google Scholar]
Jun S-R, Sims GE, Wu GA, and Kim S-H (2010). Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences, 107(1):133–138. [DOI] [PMC free article] [PubMed] [Google Scholar]
Langmead B and Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
Langmead B, Wilks C, Antonescu V, and Charles R (2019). Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 35(3):421–432. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lau A-K, Dörrer S, Leimeister C-A, Bleidom C, and Morgenstern B (2019). Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage. BMC Bioinformatics, 20(S20):638. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lefort V, Desper R, and Gascuel O (2015). FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Molecular Biology and Evolution, 32(10):2798–2800. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leimeister C-A and Morgenstern B (2014). Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leimeister C-A, Sohrabi-Jahromi S, and Morgenstern B (2017). Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics, 33(7):btw776. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lemoine F, Domelevo Entfellner J-B, Wilkinson E, Correia D, Dévila Felipe M, De Oliveira T, and Gascuel O (2018). Renewing Felsenstein?s phylogenetic bootstrap in the era of big data. Nature, 556(7702):452–456. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li H (2018). Seqtk, toolkit for processing sequences in FASTA/Q formats. [Google Scholar]
Maddison W (1989). Reconstructing character evolution on polytomous cladograms. Cladistics, 5(4):365–377. [DOI] [PubMed] [Google Scholar]
Marçis G and Kingsford C (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
McGowen MR, Tsagkogeorga G, Álvarez-Carretero S, dos Reis M, Struebig M, Deaville R, Jepson PD, Jarman S, Polanowski A, Morin PA, and Rossiter SJ (2020). Phylogenomic Resolution of the Cetacean Tree of Life Using Target Sequence Capture. Systematic Biology, 69(3):479–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller DE, Staber C, Zeitlinger J, and Hawley RS (2018). Highly contiguous genome assemblies of 15 drosophila species generated using nanopore sequencing. G3: Genes, Genomes, Genetics, 8(10):3131–3141. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mirarab S, Bayzid M, Boussau B, and Warnow T (2015). Response to Comment on ”Statistical binning enables an accurate coalescent-based estimation of the avian tree”. Science, 350(6257). [DOI] [PubMed] [Google Scholar]
Mirarab S, Bayzid MS, and Warnow T (2016). Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. Systematic Biology, 65(3):366–380. [DOI] [PubMed] [Google Scholar]
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, and Phillippy AM (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pham TD and Zuegg J (2004). A probabilistic measure for alignment-free sequence comparison. Bioinformatics, 20(18):3455–3461. [DOI] [PubMed] [Google Scholar]
Philippe H, Vienne D. M. d., Ranwez V, Roure B, Baurain D, and Delsuc F (2017). Pitfalls in supermatrix phylogenomics. European Journal of Taxonomy. [Google Scholar]
Phillips MJ, Delsuc F, and Penny D (2004). Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution. [DOI] [PubMed] [Google Scholar]
Politis DN, Romano JP, and Wolf M (1999). Subsampling. Springer Science & Business Media. [Google Scholar]
Rachtman E, Bafna V, and Mirarab S (2021). CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics, 3(3):2631–9268. [DOI] [PMC free article] [PubMed] [Google Scholar]
Rachtman E, Balaban M, Bafna V, and Mirarab S (2020). The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Molecular Ecology Resources, 20(3):1755–0998. [DOI] [PubMed] [Google Scholar]
Salichos L and Rokas A (2013). Inferring ancient divergences requires genes with strong phylogenetic signals. Nature, 497(7449):327–31. [DOI] [PubMed] [Google Scholar]
Sanderson MJ, Wojciechowski MF, Hu J-M, Khan TS, and Brady SG (2000). Error, Bias, and Long-Branch Attraction in Data for Two Chloroplast Photosystem Genes in Seed Plants. Molecular Biology and Evolution, 17(5):782–797. [DOI] [PubMed] [Google Scholar]
Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, and Mirarab S (2019). Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biology, 20(1):34. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simmons MP and Gatesy J (2021). Collapsing dubiously resolved gene-tree branches in phylogenomic coalescent analyses. Molecular Phylogenetics and Evolution, 158:107092. [DOI] [PubMed] [Google Scholar]
Smit A, Hubley R, and Green P (2013). RepeatMasker Open-4.0. [Google Scholar]
Stamatakis A (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9):1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun C, Huang J, Wang Y, Zhao X, Su L, Thomas GWC, Zhao M, Zhang X, Jungreis I, Kellis M, Vicario S, Sharakhov IV, Bondarenko SM, Hasselmann M, Kim CN, Paten B, Penso-Dolfin L, Wang L, Chang Y, Gao Q, Ma L, Ma L, Zhang Z, Zhang H, Zhang H, Ruzzante L, Robertson HM, Zhu Y, Liu Y, Yang H, Ding L, Wang Q, Ma D, Xu W, Liang C, Itgen MW, Mee L, Cao G, Zhang Z, Sadd BM, Hahn MW, Schaack S, Barribeau SM, Williams PH, Waterhouse RM, and Mueller RL (2021). Genus-Wide Characterization of Bumblebee Genomes Provides Insights into Their Evolution and Variation in Ecological and Behavioral Traits. Molecular Biology and Evolution, 38(2):486–501. [DOI] [PMC free article] [PubMed] [Google Scholar]
Susko E (2009). Bootstrap support is not first-order correct. Systematic Biology, 58(2):211–223. [DOI] [PubMed] [Google Scholar]
Tang K, Ren J, and Sun F (2019). Afann: bias adjustment for alignment-free sequence comparison based on sequencing data using neural network regression. Genome Biology, 20(1):266. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tavaré S (1986). Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sciences, 17:57–86. [Google Scholar]
Taylor DJ (2004). An Assessment of Accuracy, Error, and Conflict with Support Values from Genome-Scale Phylogenetic Data. Molecular Biology and Evolution, 21(8):1534–1537. [DOI] [PubMed] [Google Scholar]
Townsend JP, Su Z, and Tekle YI (2012). Phylogenetic Signal and Noise: Predicting the Power of a Data Set to Resolve Phylogeny. Systematic Biology, 61(5):835. [DOI] [PubMed] [Google Scholar]
VAN DER LINDE KIM, HOULE D, SPICER GS, and STEPPAN SJ (2010). A supermatrix-based molecular phylogeny of the family Drosophilidae. Genetics Research, 92(1):25–38. [DOI] [PubMed] [Google Scholar]
Weitemier K, Straub SCK, Cronn RC, Fishbein M, Schmickl R, McDonnell A, and Liston A (2014). Hyb-Seq: Combining Target Enrichment and Genome Skimming for Plant Phylogenomics. Applications in Plant Sciences, 2(9):1400042. [DOI] [PMC free article] [PubMed] [Google Scholar]
Westbury MV, Thompson KF, Louis M, Cabrera AA, Skovrind M, Castruita JAS, Constantine R, Stevens JR, and Lorenzen ED (2021). Ocean-wide genomic variation in Gray’s beaked whales, Mesoplodon grayi. Royal Society open science, 8(3):201788. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wittler R (2020). Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology, 15(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wood DE, Lu J, and Langmead B (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1):257. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wu GA, Jun S-R, Sims GE, and Kim S-H (2009). Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proceedings of the National Academy of Sciences, 106(31):12826–12831. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yi H and Jin L (2013). Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic acids research, 41(7):e75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang C, Rabiee M, Sayyari E, and Mirarab S (2018). ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19(S6):153. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zielezinski A, Girgis HZ, Bernard G, Leimeister C-A, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim S-H, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, and Karlowski WM (2019). Benchmarking of alignment-free sequence comparison methods. Genome Biology, 20(1):144. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Information

NIHMS1824003-supplement-Supplemental_Information.pdf^{(899.6KB, pdf)}

Data Availability Statement

This paper analyzes existing, publicly available data. All original studies are referenced in the main text. Accession numbers for the datasets are available in this paper’s supplemental information.
The software is available publicly at https://github.com/shahab-sarmashghi/Skmer. Raw data and summary of results are deposited in https://github.com/noraracht/subsample_support_scripts. The DOI is https://doi.org/10.5281/zenodo.6473473.
Any additional information required to reanalyze the data reported in this paper is available from the lead contact upon request.

[R1] Allman ES, Rhodes JA, and Sullivant S (2017). Statistically Consistent k -mer Methods for Phylogenetic Tree Reconstruction. Journal of Computational Biology, 24(2):153–171. [DOI] [PubMed] [Google Scholar]

[R2] Balaban M, Bristy NA, Faisal A, Bayzid MS, and Mirarab S (2022a). Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model. bioRxiv, page 2021.11.10.468111. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Balaban M, Jiang Y, Roush D, Zhu Q, and Mirarab S (2022b). Fast and accurate distance?based phylogenetic placement using divide and conquer. Molecular Ecology Resources, 22(3):1213–1227. [DOI] [PubMed] [Google Scholar]

[R4] Balaban M and Mirarab S (2020). Phylogenetic double placement of mixed samples. Bioinformatics, 36(Supplement_1):i335–i343. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Balaban M, Sarmashghi S, and Mirarab S (2020). APPLES: Scalable Distance-Based Phylogenetic Placement with or without Alignments. Systematic Biology, 69(3):566–578. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bogusz M and Whelan S (2016). Phylogenetic Tree Estimation With and Without Alignment: New Distance Methods and Benchmarking. Systematic Biology, 66(2):218–231. [DOI] [PubMed] [Google Scholar]

[R7] Bohmann K, Mirarab S, Bafna V, and Gilbert MTP (2020). Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Molecular Ecology, 29(14):2521–2534. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Boyd BM, Allen JM, Nguyen N-P, Sweet AD, Warnow T, Shapiro MD, Villa SM, Bush SE, Clayton DH, and Johnson KP (2017). Phylogenomics using Target-Restricted Assembly Resolves Intrageneric Relationships of Parasitic Lice (Phthiraptera: Columbicola). Systematic Biology, 66(6):896–911. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Bushnell B, Rood J, and Singer E (2017). BBMerge ? Accurate paired shotgun read merging via overlap. PLOS ONE, 12(10):1–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Chen L, Qiu Q, Jiang Y, Wang K, Lin Z, Li Z, Bibi F, Yang Y, Wang J, Nie W, Su W, Liu G, Li Q, Fu W, Pan X, Liu C, Yang J, Zhang C, Yin Y, Wang Y, Zhao Y, Zhang C, Wang Z, Qin Y, Liu W, Wang B, Ren Y, Zhang R, Zeng Y, da Fonseca RR, Wei B, Li R, Wan W, Zhao R, Zhu W, Wang Y, Duan S, Gao Y, Zhang YE, Chen C, Hvilsom C, Epps CW, Chemnick LG, Dong Y, Mirarab S, Siegismund HR, Ryder OA, Gilbert MTP, Lewin HA, Zhang G, Heller R, and Wang W (2019). Large-scale ruminant genome sequencing provides insights into their evolution and distinct traits. Science (New York, N.Y.), 364(6446). [DOI] [PubMed] [Google Scholar]

[R11] Coissac E, Hollingsworth PM, Lavergne S, and Taberlet P (2016). From barcodes to genomes: extending the concept of DNA barcoding. Molecular ecology, 25(7):1423–1428. [DOI] [PubMed] [Google Scholar]

[R12] Criscuolo A (2019). A fast alignment-free bioinformatics procedure to infer accurate distance-based phylogenetic trees from genome assemblies. Research Ideas and Outcomes, 5:e36178. [Google Scholar]

[R13] Efron B (1979). Bootstrap Methods: Another Look at the Jackknife. The Annals of Statistics, 7(1):1–26. [Google Scholar]

[R14] Fan H, Ives AR, Surget-Groba Y, and Cannon CH (2015). An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data. BMC Genomics, 16(1):522. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Felsenstein J (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution, 17(6):368–376. [DOI] [PubMed] [Google Scholar]

[R16] Felsenstein J (1985). Confidence Limits on Phylogenies: An Approach Using the Bootstrap. Evolution, 39(4):783–791. [DOI] [PubMed] [Google Scholar]

[R17] Felsenstein J and Kishino H (1993). Is there something wrong with the bootstrap on phylogenies? A reply to Hillis and Bull. Systematic Biology, 42(2):193–200. [Google Scholar]

[R18] Fletcher W and Yang Z (2009). INDELible: A flexible simulator of biological sequence evolution. Molecular Biology and Evolution, 26(8):1879–1888. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Haubold B (2014). Alignment-free phylogenetics and population genetics. Briefings in Bioinformatics, 15(3):407–418. [DOI] [PubMed] [Google Scholar]

[R20] Hillis DM and Bull JJ (1993). An Empirical Test of Bootstrapping as a Method for Assessing Confidence in Phylogenetic Analysis. Systematic Biology, 42(2):182–192. [Google Scholar]

[R21] Höhl M and Ragan MA (2007). Is multiple-sequence alignment required for accurate inference of phylogeny? Systematic Biology, 56(2):206–221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Holder M and Lewis PO (2003). Phylogeny estimation: traditional and Bayesian approaches. Nature Reviews Genetics, 4(4):275–284. [DOI] [PubMed] [Google Scholar]

[R23] Huang W, Li L, Myers JR, and Marth GT (2012). ART: A next-generation sequencing read simulator. Bioinformatics, 28(4):593–594. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] Jeffroy O, Brinkmann H, Delsuc F, and Philippe H (2006). Phylogenomics: the beginning of incongruence? Trends in Genetics, 22(4):225–231. [DOI] [PubMed] [Google Scholar]

[R25] Jiang Y, Balaban M, Zhu Q, and Mirarab S (2022). DEPP: Deep Learning Enables Extending Species Trees using Single Genes. Systematic Biology, page 2021.01.22.427808. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Jin L and Nei M (1990). Limitations of the evolutionary parsimony method of phylogenetic analysis. Molecular Biology and Evolution, 7(1):82–102. [DOI] [PubMed] [Google Scholar]

[R27] Jukes TH and Cantor CR (1969). Evolution of protein molecules. Mammalian protein metabolism, 3:21–132. [Google Scholar]

[R28] Jun S-R, Sims GE, Wu GA, and Kim S-H (2010). Whole-proteome phylogeny of prokaryotes by feature frequency profiles: An alignment-free method with optimal feature resolution. Proceedings of the National Academy of Sciences, 107(1):133–138. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Langmead B and Salzberg SL (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods, 9(4):357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Langmead B, Wilks C, Antonescu V, and Charles R (2019). Scaling read aligners to hundreds of threads on general-purpose processors. Bioinformatics, 35(3):421–432. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Lau A-K, Dörrer S, Leimeister C-A, Bleidom C, and Morgenstern B (2019). Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage. BMC Bioinformatics, 20(S20):638. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] Lefort V, Desper R, and Gascuel O (2015). FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program. Molecular Biology and Evolution, 32(10):2798–2800. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] Leimeister C-A and Morgenstern B (2014). Kmacs: the k-mismatch average common substring approach to alignment-free sequence comparison. Bioinformatics, 30(14):2000–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R34] Leimeister C-A, Sohrabi-Jahromi S, and Morgenstern B (2017). Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics, 33(7):btw776. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Lemoine F, Domelevo Entfellner J-B, Wilkinson E, Correia D, Dévila Felipe M, De Oliveira T, and Gascuel O (2018). Renewing Felsenstein?s phylogenetic bootstrap in the era of big data. Nature, 556(7702):452–456. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] Li H (2018). Seqtk, toolkit for processing sequences in FASTA/Q formats. [Google Scholar]

[R37] Maddison W (1989). Reconstructing character evolution on polytomous cladograms. Cladistics, 5(4):365–377. [DOI] [PubMed] [Google Scholar]

[R38] Marçis G and Kingsford C (2011). A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics, 27(6):764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] McGowen MR, Tsagkogeorga G, Álvarez-Carretero S, dos Reis M, Struebig M, Deaville R, Jepson PD, Jarman S, Polanowski A, Morin PA, and Rossiter SJ (2020). Phylogenomic Resolution of the Cetacean Tree of Life Using Target Sequence Capture. Systematic Biology, 69(3):479–501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Miller DE, Staber C, Zeitlinger J, and Hawley RS (2018). Highly contiguous genome assemblies of 15 drosophila species generated using nanopore sequencing. G3: Genes, Genomes, Genetics, 8(10):3131–3141. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Mirarab S, Bayzid M, Boussau B, and Warnow T (2015). Response to Comment on ”Statistical binning enables an accurate coalescent-based estimation of the avian tree”. Science, 350(6257). [DOI] [PubMed] [Google Scholar]

[R42] Mirarab S, Bayzid MS, and Warnow T (2016). Evaluating Summary Methods for Multilocus Species Tree Estimation in the Presence of Incomplete Lineage Sorting. Systematic Biology, 65(3):366–380. [DOI] [PubMed] [Google Scholar]

[R43] Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, and Phillippy AM (2016). Mash: fast genome and metagenome distance estimation using MinHash. Genome Biology, 17(1):132. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] Pham TD and Zuegg J (2004). A probabilistic measure for alignment-free sequence comparison. Bioinformatics, 20(18):3455–3461. [DOI] [PubMed] [Google Scholar]

[R45] Philippe H, Vienne D. M. d., Ranwez V, Roure B, Baurain D, and Delsuc F (2017). Pitfalls in supermatrix phylogenomics. European Journal of Taxonomy. [Google Scholar]

[R46] Phillips MJ, Delsuc F, and Penny D (2004). Genome-scale phylogeny and the detection of systematic biases. Molecular Biology and Evolution. [DOI] [PubMed] [Google Scholar]

[R47] Politis DN, Romano JP, and Wolf M (1999). Subsampling. Springer Science & Business Media. [Google Scholar]

[R48] Rachtman E, Bafna V, and Mirarab S (2021). CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics, 3(3):2631–9268. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Rachtman E, Balaban M, Bafna V, and Mirarab S (2020). The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Molecular Ecology Resources, 20(3):1755–0998. [DOI] [PubMed] [Google Scholar]

[R50] Salichos L and Rokas A (2013). Inferring ancient divergences requires genes with strong phylogenetic signals. Nature, 497(7449):327–31. [DOI] [PubMed] [Google Scholar]

[R51] Sanderson MJ, Wojciechowski MF, Hu J-M, Khan TS, and Brady SG (2000). Error, Bias, and Long-Branch Attraction in Data for Two Chloroplast Photosystem Genes in Seed Plants. Molecular Biology and Evolution, 17(5):782–797. [DOI] [PubMed] [Google Scholar]

[R52] Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, and Mirarab S (2019). Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biology, 20(1):34. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] Simmons MP and Gatesy J (2021). Collapsing dubiously resolved gene-tree branches in phylogenomic coalescent analyses. Molecular Phylogenetics and Evolution, 158:107092. [DOI] [PubMed] [Google Scholar]

[R54] Smit A, Hubley R, and Green P (2013). RepeatMasker Open-4.0. [Google Scholar]

[R55] Stamatakis A (2014). RAxML version 8: A tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics, 30(9):1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Sun C, Huang J, Wang Y, Zhao X, Su L, Thomas GWC, Zhao M, Zhang X, Jungreis I, Kellis M, Vicario S, Sharakhov IV, Bondarenko SM, Hasselmann M, Kim CN, Paten B, Penso-Dolfin L, Wang L, Chang Y, Gao Q, Ma L, Ma L, Zhang Z, Zhang H, Zhang H, Ruzzante L, Robertson HM, Zhu Y, Liu Y, Yang H, Ding L, Wang Q, Ma D, Xu W, Liang C, Itgen MW, Mee L, Cao G, Zhang Z, Sadd BM, Hahn MW, Schaack S, Barribeau SM, Williams PH, Waterhouse RM, and Mueller RL (2021). Genus-Wide Characterization of Bumblebee Genomes Provides Insights into Their Evolution and Variation in Ecological and Behavioral Traits. Molecular Biology and Evolution, 38(2):486–501. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] Susko E (2009). Bootstrap support is not first-order correct. Systematic Biology, 58(2):211–223. [DOI] [PubMed] [Google Scholar]

[R58] Tang K, Ren J, and Sun F (2019). Afann: bias adjustment for alignment-free sequence comparison based on sequencing data using neural network regression. Genome Biology, 20(1):266. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R59] Tavaré S (1986). Some Probabilistic and Statistical Problems in the Analysis of DNA Sequences. Lectures on Mathematics in the Life Sciences, 17:57–86. [Google Scholar]

[R60] Taylor DJ (2004). An Assessment of Accuracy, Error, and Conflict with Support Values from Genome-Scale Phylogenetic Data. Molecular Biology and Evolution, 21(8):1534–1537. [DOI] [PubMed] [Google Scholar]

[R61] Townsend JP, Su Z, and Tekle YI (2012). Phylogenetic Signal and Noise: Predicting the Power of a Data Set to Resolve Phylogeny. Systematic Biology, 61(5):835. [DOI] [PubMed] [Google Scholar]

[R62] VAN DER LINDE KIM, HOULE D, SPICER GS, and STEPPAN SJ (2010). A supermatrix-based molecular phylogeny of the family Drosophilidae. Genetics Research, 92(1):25–38. [DOI] [PubMed] [Google Scholar]

[R63] Weitemier K, Straub SCK, Cronn RC, Fishbein M, Schmickl R, McDonnell A, and Liston A (2014). Hyb-Seq: Combining Target Enrichment and Genome Skimming for Plant Phylogenomics. Applications in Plant Sciences, 2(9):1400042. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] Westbury MV, Thompson KF, Louis M, Cabrera AA, Skovrind M, Castruita JAS, Constantine R, Stevens JR, and Lorenzen ED (2021). Ocean-wide genomic variation in Gray’s beaked whales, Mesoplodon grayi. Royal Society open science, 8(3):201788. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] Wittler R (2020). Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology, 15(1):4. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R66] Wood DE, Lu J, and Langmead B (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1):257. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R67] Wu GA, Jun S-R, Sims GE, and Kim S-H (2009). Whole-proteome phylogeny of large dsDNA virus families by an alignment-free method. Proceedings of the National Academy of Sciences, 106(31):12826–12831. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R68] Yi H and Jin L (2013). Co-phylog: an assembly-free phylogenomic approach for closely related organisms. Nucleic acids research, 41(7):e75. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] Zhang C, Rabiee M, Sayyari E, and Mirarab S (2018). ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19(S6):153. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] Zielezinski A, Girgis HZ, Bernard G, Leimeister C-A, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim S-H, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, and Karlowski WM (2019). Benchmarking of alignment-free sequence comparison methods. Genome Biology, 20(1):144. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Quantifying the uncertainty of assembly-free genome-wide distance estimates and phylogenetic relationships using subsampling

Eleonora Rachtman

Shahab Sarmashghi

Vineet Bafna

Siavash Mirarab

Abstract

Graphical Abstract

In Brief

Introduction

Results

Subsampling Procedure

Subsampling: justification and theory

Why not bootstrapping?

Figure 1: Problems with resampling.

Why Subsampling?

Subsampling Procedure

Algorithm 1.

Figure 2: Workflow diagrams.

Simulation results

Distance variance when simulating genome pairs

Figure 3: Genome pair analyses.

Simulation under Felsenstein-zone quartet trees with long branch attraction

Figure 4: Support accuracy on simulated Felsenstein-zone phylogenies.

Simulations on 8-taxon trees with model misspecification.

Figure 5: Support accuracy on the eight-taxon dataset.

Results on Real datasets

Cetaceans: low resolution within species.

Figure 6: Phylogenies constructed using biological datasets.

Low support for conflicting results.

Impact of coverage.

Support versus length.

Discussion

STAR Methods

Resource availability

Lead contact

Materials availability

Data and code availability

Method details

Skmer

Subsampling

Datasets

Genome pairs.

Felsenstein-zone quartet trees.

8-taxon simulations with model misspecification.

Measuring accuracy on simulated datasets

Real biological datasets.

Cetaceans.

Insect datasets.

Supplementary Material

Box 1: Progress and Potential.

Highlights.

Acknowledgements

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases