Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

Shahab Sarmashghi; Metin Balaban; Eleonora Rachtman; Behrouz Touri; Siavash Mirarab; Vineet Bafna

doi:10.1371/journal.pcbi.1009449

. 2021 Nov 15;17(11):e1009449. doi: 10.1371/journal.pcbi.1009449

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

Shahab Sarmashghi ¹, Metin Balaban ², Eleonora Rachtman ², Behrouz Touri ¹, Siavash Mirarab ¹, Vineet Bafna ^3,^*

Editor: Nicola Segata⁴

PMCID: PMC8629397 PMID: 34780468

Abstract

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome-skims) could be transformative for genomic ecology, and results using k-mers have shown the advantage of this approach in identification and phylogenetic placement of eukaryotic species. Here, we revisit the basic question of estimating genomic parameters such as genome length, coverage, and repeat structure, focusing specifically on estimating the k-mer repeat spectrum. We show using a mix of theoretical and empirical analysis that there are fundamental limitations to estimating the k-mer spectra due to ill-conditioned systems, and that has implications for other genomic parameters. We get around this problem using a novel constrained optimization approach (Spline Linear Programming), where the constraints are learned empirically. On reads simulated at 1X coverage from 66 genomes, our method, REPeat SPECTra Estimation (RESPECT), had 2.2% error in length estimation compared to 27% error previously achieved. In shotgun sequenced read samples with contaminants, RESPECT length estimates had median error 4%, in contrast to other methods that had median error 80%. Together, the results suggest that low-pass genomic sequencing can yield reliable estimates of the length and repeat content of the genome. The RESPECT software will be publicly available at https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_shahab-2Dsarmashghi_RESPECT.git&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=717o8hLR1JmHFpRPSWG6xdUQTikyUjicjkipjFsKG4w&e=.

Author summary

The cost of sequencing the genome is dropping at a much faster rate compared to assembling and finishing the genome. The use of lightly sampled genomes (genome skims) could be transformative for genomic ecology. Analyzing genome skims, mostly based on statistics of small oligomers, remains challenging, but recent results have shown the advantage of this approach for the identification and phylogenetic placement of eukaryotic species. In this paper, we present a method, RESPECT, to estimate genomic properties such as genome length and repetitiveness from low-coverage genome skims. We trained RESPECT using assembled genomes and tested it on low-coverage simulated and real reads. Benchmarking results reveal that RESPECT has excellent accuracy in estimating the genome length compared to other methods, and can provide critical information regarding the repeat structure of the genome.

This is a PLOS Computational Biology Methods paper.

Introduction

Anthropogenic pressure and other natural causes have resulted in severe disruption of global ecosystems in recent years, including loss of biodiversity [1]. In North America alone, the bird population has declined by over a quarter since 1970 [2]. Simply understanding the scope and extent of bio-diversity changes remains a challenging problem. Genomic sequence based biodiversity sampling provides an attractive alternative to physical sampling and cataloging, as falling costs have made it possible to shotgun sequence a reference specimen sample for at most $10 per Gb (with another $60 for sample prep). However, the analysis typically requires assembling and finishing a reference genome, which can still be prohibitively costly. Despite the many projects aimed at high quality genome sequencing of eukaryotic species [3], it could be many decades before we have acquired high-quality data so that biodiversity measurements for each population can be acquired on an ongoing, routine basis.

While (meta)barcoding [4–6] methods can be used for species identification and biodiversity measurements, they have many drawbacks including limited phylogenetic resolution [7, 8]. Organelle assembly based methods [9–11] similarly cannot be used for populations and often require whole genome sequences but discard the nuclear reads (the vast majority of data). Therefore, there is renewed interest in the development of methods that use all nuclear DNA from genome-skims–low-coverage (0.5–2Gb) sequencing, providing 0.2–4× coverage [12]. The low coverage of skims makes them cost-effective, but insufficient for assembling, and calls for assembly-free methods. Such methods, based on analysis of k-mers are being actively developed [13], and have been used for species identification (Skmer [14]); for phylogenetic placement of a new species not in the library (APPLES [15]), and contaminant filtering (CONSULT [16]). While k-mer analysis works well for species identification, it cannot be applied easily for the analysis of populations (individuals from the same species) using genome-skims, a key component of genomic ecology. Specifically, it ignores the effect of repeats, and uses heuristics to estimate sequencing error and coverage, neither of which is known.

In this manuscript, we revisit the problem of estimating genomic parameters from genome-skim data: specifically, genome length L, sequence-coverage c, and repeat content. From genome-skim data, we have as input, abundance values of k-mers denoted by o, where o_h denotes the number of distinct k-mers of multiplicity h. A key latent variable is the k-mer-repeat-spectrum (denoted hereafter as the k-mer spectrum) of the genome described by r, where r_h denotes the number of distinct k-mers that appear exactly h times in the genome. As the value of o_h depends upon r, c, L, and also on sequencing error, we consider the inverse problem of estimating genomic parameters given o as input. The problem was studied in a seminal paper by Li and Waterman [17] who mostly considered the case of high coverage and no sequencing errors. Williams et al. [18] improved upon this model by ignoring o₁ assuming that a large proportion of unique k-mers can be attributed to sequencing errors. This assumption works better for high coverage because at low coverage, many informative k-mers are also seen only once. Hozza et al. [19] point this out, and focus attention on k-mer spectra. Their method, CovEst, models spectra using a geometric distribution of unknown parameters, uses that parameterized model to estimate both parameters and r₁, r₂, r₃, and improves estimates even for low coverage and high error.

A distinct but related line of research relates to estimating o itself by sub-sampling or streaming reads. Melsted and colleagues [20, 21] describe streaming algorithms to estimate o₁ as well as moments F_k = ∑_i i^k o_i. Interestingly, these moments can also be used to estimate genome parameters. For example, $E [F_{1}] = λ L$ , where λ = (1 − (k − 1)/ℓ)c denotes the k-mer coverage, or the average number of k-mers covering a position derived from reads of length ℓ. We note that streaming is akin to low-coverage sampling and consider the case of estimating parameters over a range of λ.

Estimating genome repetitiveness and other parameters using k-mers

While previous research has emphasized the estimation of genome length and coverage, we focus specifically on estimating the k-mer spectrum r, defined below. Consider a genome of length L. Decompose the genome into a collection of all fixed-length (overlapping) sequences of length k, called k-mers. Let variables r_j (j ≥ 1) denote the number of k-mers that occur exactly j times in the genome. When k is large enough (k ≥ log₄ L), high values of r_j, for j ≥ 2, can be attributed to the repetitive structures in the genome rather than chance similarities. Therefore, we define r = [r₁, r₂, ⋯] as the (k-mer)-repeat-spectrum of the genome.

While the repetitive sequences occur in a variety of arrangements in terms of their multiplicity, complexity and the size of repeating unit, the repeat spectrum provides a valuable summary of the extent of repetition in the genome as well as other parameters. For example, the genome length can be estimated as L = k − 1 + ∑_j jr_j ≃ ∑_j jr_j. Define the uniqueness ratio of a genome as r₁/L, or the ratio of the number of k-mers seen only once to the genome length (which is the total number of k-mers in the genome). We computed the uniqueness ratio for 622 eukaryotic genomes in RefSeq using k = 31 (S1 Fig). The ratio revealed a broad spectrum of values, ranging from 0.287 for A. tauschii (Tausch’s goatgrass) to 0.995 for a mite species, V. jacobsoni (Fig 1A). Expectedly, there is some phylogenetic correlation and the variation of uniqueness ratio within a genus (intra-generic) is significantly lower than inter-generic variation of uniqueness ratios (S2 Fig). At higher taxonomic ranks, we observed that plants had a significantly lower uniqueness ratio compared to other groups (Fig 1B), consistent with a prevalence of whole genome duplication (WGD) events (see Methods). Nevertheless, the correlation is not strong enough to predict uniqueness ratios solely from taxonomy. For example, rice species O. sativa and O. brachyantha have different ratios 0.91 and 0.75, respectively.

Fig 1 — A: RefSeq plant taxonomy. The species are color-coded based on the uniqueness ratio, from red (highly repetitive) to blue (non-repetitive). B: Uniqueness ratio distribution among four major taxonomic groups of eukaryotes in RefSeq. Plants (green) have significantly lower r1/L compared to invertebrates (pink), mammals (yellow), and other vertebrates (blue). P-values shown on the figure, are the result of statistical tests that the uniqueness ratio is lower among plants compared to other groups. Also, to understand the extent of difference, we tested if the ratios are lower among plants by X% margin. The results are 5% p-value = 1.1 × 10⁻⁶, 10% p-value = 4.3 × 10⁻⁶, and 10% p-value = 4.2 × 10⁻⁶ when comparing plants against invertebrates, mammals, and other vertebrates, respectively. C: Dot-plot of *V. jocobsoni* genome’s (self)alignment with very few off-diagonal points, and a rapidly decaying repeat spectrum (r₁/L = 0.99). D: Dot-plot of *D. citri*’s highly-repetitive genome marked by many off-diagonal elements and a smoothly decreasing repeat spectrum (r₁/L = 0.51).

The repeat spectrum provides other insights. In genomes composed largely of unique sequences, r₁/L ≃ 1 and r_j values decrease rapidly for j ≥ 2 with log r₁/r₅ ≥ 4.5 (Fig 1C). On the other hand, genomes with higher repetitive content have a smoother decrease of r_j values (Fig 1D) with log $\frac{r_{1}}{r_{5}} \leq 2.5$ (S3 Fig). Additionally, a genome that has duplicated very recently will have r₁ ≃ 0 and a very high value of r₂. Over time, however, r₁ increases due to the accumulation of mutations. Similarly, r_j > 0 for large values of j suggest the presence of interspersed repeats.

Our method RESPECT (Repeat Spectrum identification) derives genomic length and coverage from low-coverage genome skims, while also providing insight into the repeat structure. We showed, through a mix of theoretical reasoning and empirical evidence, that the k-mer repeat spectra estimation problem is fundamentally difficult because of severe ill-conditioning of the system. In fact, the spectra are hard to estimate even when the coverage and sequencing error rate are known. We resolve this problem for the case of known coverage and sequencing error by imposing constraints on r_h and solving a constrained optimization problem. This approach provides greatly improved estimates of r, which in turn lead to even better estimation of coverage, genome length and sequencing error through a stochastic iteration method. Results on genomes sampled from different parts of the tree of life and with differing repeat structures illustrate the validity of our approach.

Results

A simple model for estimating repeat spectra from unassembled data performs poorly

Assume that reads in the genome-skim are sequenced with a fixed mean error rate of ϵ per bp, and that the read start positions follow a Poisson distribution with a mean coverage of λ per bp. Denote the observed k-mer data as the vector o = [o₁, o₂, ⋯], where o_h denotes the number of k-mers observed exactly h times in the genome-skim input. The value o_h is the outcome of a random variable O_h that depends upon the parameter set Φ = {λ, ϵ, r} (See Methods: ‘Modeling genomic parameters’). Specifically, we assume that each k-mer with copy number j in the genome is sampled h times according to a Poisson distribution with rate dependent upon k, Φ. Let P_hj represent the probability of h observances of a k-mer with copy number j. Then, in expectation,

\begin{matrix} E [O] = r P^{T} + 1_{h = 1} E \end{matrix}

(1)

where E is the expected number of erroneous k-mers that in turn depends upon Φ. Φ could be estimated using:

\begin{matrix} Φ = arg min_{Φ} ‖ o - E [O] ‖ = arg min_{Φ} ‖ o - (r P^{T} + 1_{h = 1} E) ‖ \end{matrix}

(2)

In principle, an iterative procedure could be used to solve the optimization; we start with initial estimates of λ and ϵ, and use them to compute P and E. Then, we can use the least-square (LS) method to find r which minimizes ‖o − (rP^T + 1_h=1 E)‖ (Eq 2) (See Methods: ‘Least-squares estimate of repeat spectrum’).

To study the accuracy of this model for repeat spectra estimation, we simulated genome skims at 1X coverage with no sequencing errors (E = 0) for all 622 genomes in RefSeq in four major taxonomic groups of eukaryotes. A subset of 66 species was selected as the test set. The test genomes were sampled such that their uniqueness-ratio (r₁/L) values matched the distribution of uniqueness-ratios of all 622 RefSeq genomes (S4 Fig, see Methods: ‘Comparing r1/L distribution over different sets’). In the following text, all parameters were trained on the 556 training genomes, and all test results shown on the 66 test genomes.

For a baseline test, we assumed that the coverage λ was known, so that r could be estimated using ‖o − rP^T‖₂ (Eq 2). Using an LS solver (see Methods: ‘Least-squares estimate of repeat spectrum’), we obtained highly accurate estimates of r₁ on the test data (Fig 2A; LS method). However, even in this simple case with perfect knowledge of coverage and no sequencing error, the error in estimating r_j increased rapidly with increasing j, as the LS solution was often sparse and the estimation set r_j = 0 for many j’s, contrary to its true value in the genome.

Fig 2 — A: The relative error in estimating repeat spectra using Least-Squares (LS), constrained Linear Programming (LP), and Spline Linear Programming (SLP). The genome-skims are simulated at 1X with no sequencing error. B: Correlation between true r₂/r₁ ratios, and our estimates of r₁/∑_i=1 r_i for each genome. C: Similar correlation plot between true r₃/r₂ and estimated r₂/∑_i=2 r_i. In both B and C, true spectral ratios on Y axis are computed from the assemblies, and the estimated indices on X axis are obtained by applying the LP method to the simulated skims described in A.

Empirical and theoretical results showed that the poor performance could be attributed to severe ill-conditioning. We proved that the condition number of P grows exponentially with the number of spectra (see S1 Appendix). Therefore, small changes in o relative to $E [O]$ (Eq 1), for example due to the sampling variability or the simplifying assumptions of model, led to very large errors in estimates of r.

Overview of RESPECT algorithm

The negative result suggested a fundamental limitation to the use of k-mer based methods for estimating repeat spectra. Regularization is a proposed remedy for ill-conditioned matrices. However, most regularization methods enforce sparsity and r is known to be not sparse. A second challenge is that both observed counts and k-mer spectra are very skewed towards lower indices. Thus, a small (even 1%) relative error in r₁ could lead to a larger error in r_j for j > 1. To get around the ill-conditioning problem, we focused on constraining possible values of r. We observed empirically that ratio of consecutive spectral values r_j+1/r_j was tightly constrained. Fig 2B traces r₂/r₁ as a function of $\frac{r_{1}}{\sum_{i \geq 1} r_{i}}$ on the training data and notes the tight correlation across all taxonomic groups. A similar, albeit less tight, constraint was observed for r₃/r₂ (Fig 2C) and other values as well (S5, S6 and S7 Figs).

These ideas provided the basis of a constrained linear-program for estimating r. As a first step, we added the constraint that $L_{j} \leq \frac{r_{j}}{r_{j + 1}} \leq U_{j}$ for each j, where $L_{j}$ and $U_{j}$ are the smallest and the largest $\frac{r_{j}}{r_{j + 1}}$ ratios over the training genomes, and solved the following LP to find r (see Methods: ‘Linear programming for constrained optimization based estimates’)

\begin{matrix} r = arg min_{r} E = arg min_{r} \sum_{h = 2}^{n} | o_{h} - \sum_{j = 1}^{n} P_{h j} r_{j} | \end{matrix}

(3)

This approach significantly improved the average error in estimating the spectra at multiplicity j = 3 and higher (Fig 2A; LP method), and resulted in small improvement at j = 1, 2 as well.

Using the repeat spectra from 556 training genomes, we observed a strong correlation between r₂/r₁ and r₁/∑_i≥1 r_i (Fig 2B). Therefore, we estimated r₂/r₁ by using the LP estimate of r₁/∑_i≥1 r_i and a spline fitted on the training data based on a generalized additive model [22, 23] (see Methods: ‘Spline Linear programming’). The estimated r₂/r₁ value and the LP estimated r₁ value provided a new estimate (named SLP) of r₂. In a similar fashion, we computed SLP estimates of r_j+1 from LP estimate of r_j and r_j/∑_i≥j r_i for j = 2, 3, 4, 5 (Fig 2C, S5, S6 and S7 Figs, and Methods: ‘Spline Linear programming’). Using the additional information learned from the training genomes captured by the fitted splines, we obtained significant reduction in the average error of repeat spectra estimation (SLP vs. LP in Fig 2A). To solve the full optimization problem in Eq 2, we used a simulated annealing procedure. Specifically, starting with initial estimates of parameters obtained under no-repeat assumption, at each iteration a new values for λ is suggested, and SLP method is used to estimate r. If a candidate λ results in a reduction in error, the algorithm accepts the move. Moreover, to avoid getting stuck at local minima, occasionally moves to states with higher error are also accepted. Lastly, the initial estimate of ϵ is corrected for the repetitiveness of genome using a regression learned over a subset of training genomes (S8 Fig). The algorithm is outlined below (also see Methods: ‘RESPECT algorithm’ for a detailed description).

Generate initial estimates of λ, ϵ, and r.
Compute the initial values of P and error function $E$ .
For t = 1, ⋯, N repeat:
- 3.1
  Choose λ_next randomly within a neighborhood of current λ, and compute P_next.
- 3.2
  Solve for r_next using SLP method.
- 3.3
  Use P_next and r_next to compute $E_{next}$ .
- 3.4
  Set λ ← λ_next, $E \leftarrow E_{next}$ , and r ← r_next with probability $min {1, exp (- (E_{next} - E) t / N)}$ .
Correct the initial estimate of ϵ, and update λ
Output c = λℓ/(ℓ − k + 1), r, L = B/c, and ϵ at the end of iterations (B is the total amount of nucleotides sequenced).

Estimating genome lengths

We applied RESPECT and CovEst to simulated genome-skims–Illumina reads sampled from the 66 test genomes skimmed at 1X coverage with 1% sequencing-error rate–and compared their relative error in the estimation of r₁ through r₅ and genome length (Fig 3), after their convergence (see S9–S14 Figs for the convergence of RESPECT’s estimates). The median RESPECT error in estimating r₁ was less than 1.5% (average: 2.9%), while the median error of CovEst was 15% (average: 34%). The error profile extended to higher multiplicities, where, as noted earlier, CovEst used a parametric model. The tight relation between r₁ and r₂ and the large absolute differences between the two values implied that a small error in r₁ would translate into a large relative error for r₂, and we observed that for r₂. Similarly, the RESPECT estimates of genome length were highly accurate with median error 2.2% (average: 4.1%), in contrast to 27% (average: 40%) for CovEst (Fig 3B). RESPECT estimates were better than CovEst in 62 out of 66 species, often by considerable margins (Fig 3C). For example, in 54/66 species, RESPECT error was less than 5%, while CovEst error exceeded 50% in one third of test genomes. In fact, CovEst severely underestimates the length for these genomes (S15 Fig). For 18/66 test genomes, the CovEST estimate was less than the true length by a factor of 4 or higher (S16 Fig). RESPECT relies on models trained using available assemblies. We tested if the performance depended on the amount of training data and the taxonomic composition of the training data. RESPECT performance remained robust in these scenarios (S17(A) Fig). Moreover, its performance improved slightly (had fewer outliers) with additional training data (S17(B) Fig).

Fig 3 — A: Comparing the error of RESPECT and CovEst in estimating the repeat spectrum. The first 5 spectra are shown. B: The distribution of error in CovEst and RESPECT. The absolute value of relative error in genome length estimation is used (in logarithmic scale). C: Per-genome error of RESPECT and CovEst in estimating the genome length of 66 species with genomes skimmed at 1X coverage.

We repeated the same experiment at sequence level coverage of 0.5X, 2X, and 4X (S18 Fig). At 0.5X coverage, the median error of RESPECT was 16% (average: 18%), while CovEst had 88% median error (average: 75%) and underestimated the length by a factor of 8 or more in half of the species (S19 Fig). CovEst performance improved at higher coverage but RESPECT continued to have lower error (S20 Fig). At 4X, CovEst had median error 3.3% (average: 7.6%), while RESPECT median error was < 1% (average: 1.9%). Moreover, CovEst error exceeded 10% error in a third of species, while RESPECT had < 10% error in 64/66 species (S21 Fig).

We also compared the performance of RESPECT among different taxonomic groups. In general, plants and invertebrates had higher error rates compared to both vertebrate groups (S22 Fig), consistent with their lower uniqueness ratios (Fig 1B). In fact, we observed a statistically significant negative correlation between the estimation error and the uniqueness ratio (S23 Fig). We additionally tested RESPECT on simulated genome-skims at 1X coverage from 10 bacterial genomes, and the results did not suggest any bias against prokaryotic genomes (S24 Fig), despite the fact that we trained our model on eukaryotic genomes.

Estimating genome length using sequenced short reads

A key difference between sequenced reads versus simulated reads is the presence of ‘contaminants’ or reads from non-target species. Differences may also include presence of adapter sequences, duplications of reads from the sequencing platform, lower or higher sequencing error rates due to DNA quality, and length variation of reads. Therefore, we tested RESPECT in genome-skims obtained from NCBI’s Sequence Read Archive (SRA) database [24]. We downloaded high-coverage raw reads from 29 test species (from all four major taxonomic groups of eukaryotes in RefSeq) including highly repetitive plant genomes, and compared the results with the corresponding genome assemblies of the same data. After preprocessing the raw reads using BBTools [25] to remove adapter sequences and duplicate reads, we used Kraken [26] to remove contaminant reads with microbial or human origin (see Methods: ‘SRA preprocessing and contamination filtering’). We note that this is an imperfect process as these tools work only when the contaminating organisms have a highly related member in the reference databases [27]. We discarded 10 samples because > 40% of reads (after removing adapters) were either duplicates of other reads, or came from external DNA sources (Table A in S1 Appendix). For the remaining 19 samples, duplicates and reads classified as contaminant were removed, and unclassified reads were sub-sampled to 1X coverage. In 16 out of 19 samples, RESPECT error was less than 11% (median: 4%), including highly repetitive genomes such as A. tauschii (r₁/L = 0.29), Z. mays (maize) (r₁/L = 0.32), S. salar (salmon) (r₁/L = 0.48), and N. tabacum (r₁/L = 0.57), where the abundance of repeats made the length estimation challenging (Fig 4, Table 1). In contrast, CovEst had less than 30% error in only 4 samples (median error 80%) (Fig 4). For the highly repetitive genomes, CovEst length estimates ranged from 1/11 to 1/7 of the assembled sequence lengths or 10 to 30 times larger error compared to RESPECT (see Table 1). In 3 samples, RESPECT had relatively high errors. For SRR085103 (domestic ferret), 99.9% of the reads did not in fact map to the available reference assembly of the domestic ferret M. putorious. Together with the relatively low percentage of duplication (9%) the data suggest a mislabeling of the sample species. For Coquerel’s sifaka (P. coquereli), we observed a large gap between the total sequence length (2.8 Gbp) and the total ungapped length (2.1 Gbp) of the assembly, suggesting some challenges with the assembly. Cape elephant shrew (E. edwardii) was the last sample where RESPECT length estimate of 4.5Gbp exceeded the RefSeq (GCF_000299155.1) assembly length (3.8Gbp) by over 10%. Interestingly, the uniqueness ratio of the assembly was r₁/L = 0.72, which contrasted with the RESPECT estimated uniqueness ratio of r₁/L = 0.65 from the short-read data. Upon investigation, we found that a more recent assembly for E. edwardii (GCA_004027355.1), not yet in RefSeq, had an assembled length equal to 4.3 Gbp, with r₁/L = 0.66, matching the RESPECT estimates (4.5Gb, 0.65, respectively). The difference between total sequence length and ungapped length in GCA_004027355.1 was only 1 Mbp, in contrast to > 500 Mbp for GCF_000299155.1. Together, these data suggest that GCA_004027355.1 better assembles repetitive regions, and the RESPECT length estimation error was < 5%, despite using only 1X coverage.

Fig 4 — Comparing the error of CovEst and RESPECT. High coverage SRA were preprocessed and later downsampled to 1X coverage. Both methods are applied to genome skims (after preprocessing) and the absolute values of the relative error in estimating the genome lengths are compared.

Table 1. Comparing RESPECT and CovEst accuracy on SRA’s of highly repetitive genomes.

The numbers in parentheses are the percentage errors.

Species	A. tauschii (goat grass)	Z. mays (maize)	S. salar (salmon)	N. tabacum (tobacco)
r₁/L	0.29	0.32	0.48	0.57
Assembly length (Gbp)	4.3	2.1	3.0	3.6
RESPECT	3.9 (-10.7%)	2.0 (-8.2%)	2.8 (-4.9%)	3.7 (2.6%)
CovEst	0.4 (-90%)	0.2 (-90%)	0.3 (-90%)	0.5 (-86%)

Open in a new tab

The role of WGD versus high copy repeat elements in shaping genome repeat structure

Predicting polyploidy and recent WGD is challenging because mutation and gene loss after a WGD event can reduce the polyploidy signal. Specifically, a WGD event results in the uniqueness ratio (r₁/L) becoming 0. Subsequently, as mutations accumulate, r₁/L ratio moves gradually towards 1 in a process that may be specific to the species, and hard to predict. Nevertheless, it should be skewed toward smaller values for recent WGD events. Independently, the presence of high copy repeats due to DNA transposons and retrotransposons can lead to very high copy numbers of a small set of oligomers. To capture the contribution of high copy repeat elements, we defined the ‘High Copy Repeats per Million (HCRM)’ value as the average count (per million base-pairs) of the 10 most highly repetitive k-mers. HCRM values varied across the species, ranging from 2 to 3738 among our set of 622 RefSeq genomes (S25 Fig). We observed some correlation between HCRM values of species of the same genus, especially among vertebrates (S26 Fig). However, similar to the case of uniqueness ratios, the phylogenetic signal was not pronounced enough to predict HCRM based on the taxonomy.

Analytical calculations showed that the probability of high HCRM values ≥200 in a genome with random set of k-mers was negligibly small (P ≤ 10⁻¹⁰⁰) (See Methods: ‘Statistical analysis of the repeat structure’), suggesting that high HCRM values could not be explained solely by WGD events, and were likely due to high copy (transposon) repeats. Fig 5 shows the (r₁/L, HCRM) value of 622 genome-skims, which tightly matched the true values computed from assembled genomes (S27 Fig). To analyze the r₁/L and HCRM values of genomes with recent WGD, we compiled a partial list of species with known WGD events within the last 150M years based on the available literature [28–30] (See Methods: ‘Selecting species with known recent WGD events’ and Table B in S1 Appendix).

Fig 5 — Most of genomes with known recent WGD events had r₁/L < 0.8 and HCRM < 200. The y-axis is in a logarithmic scale. HCRM values are computed from genome-skims simulated at 1X coverage with no sequencing error. Some of the species with a recent WGD are labeled by their common names.

Species with known recent WGD events had expectedly low r₁/L. For example, only 14% of species with recent WGD had r₁/L values ≥ 0.8, in contrast with 64% of all species that had r₁/L values higher than 0.8. Surprisingly, 93% of species with recent WGD also had low HCRM values (≤ 200) (Fig 5), and there was a strong association between the occurrence of recent WGD events and the (r₁/L, HCRM) values (p-value: 1.8 × 10⁻²³; See Methods: ‘Statistical analysis of the repeat structure’). Our results suggest that genomes with low HCRM and r₁/L are strong candidates for WGD events.

Discussion

In this paper, we revisited the problem of estimating genomic parameters (length, sequence coverage, k-mer spectra) based on low coverage shotgun sequencing data. The problem has been studied previously and was considered challenging due to the need for simultaneous inference of coverage and sequencing errors along with the k-mer spectra. However, our results suggest that the problem remains challenging even when there is no error and the coverage is known. This is due to two factors. (a) The linear system is ill-conditioned, so that a small change in the k-mer counts due to random sampling can lead to large changes in the estimated k-mer spectra (b) Values in the k-mer spectra show a skewed and non-sparse distribution, where r₁ dominates; r₁ is important for length estimation, but controlling for small errors in r₁ leads to larger errors in the other r_h values. We provide evidence of both, but future work will clarify the importance of each facet of the identification.

Proposed solutions for ill-conditioning use regularization but those methods generally enforce sparse solutions. However, the true k-mer distribution is not sparse. Our work resolved this issue through an empirical estimation of k-mer ratios based on finished genomes. This approach is viable given the many finished genomes with different repeat characteristics. Our study, with 662 genomes of which around 10% were isolated for testing, is the largest empirical study of its kind.

As expected, accurately estimated k-mer spectra led to better estimation of genomic parameters such as length, with RESPECT performing significantly better than the previous best method, sometimes by orders of magnitude. Our results also have lower variance than those of other methods.

As coverage increases, all methods perform well. However, at coverage 8X and higher, partial assemblies are possible and small contigs can start to be assembled. In those cases, alternative methods to estimate genome lengths may be possible, but our methods work well even for 0.5X coverage.

We had used every genome for which the assembled sequence and the raw-reads were available at the time of submission. Recently, new data has been been released, and we tested our method on 10 additional samples with very similar performance (S28 Fig).

The presence of contaminants is a significant barrier to accurate estimations, and in fact is challenging even for assembling the data. As data sampling and DNA extraction methods improve, this problem will likely be less problematic. In parallel, we are also working to improve computational approaches to removing contamination.

While most k-mer based statistics were developed as an initial first step prior to deep sequencing and assembly, they may have an important role to play in independent analysis of genomes. Many genomes are ≤ 1Gb or lower. Therefore acquiring genome-skims for a majority of organisms and even multiple individuals in a population is a feasible goal. Methods that work on these reduced representations can be transformative for studying dramatic and short-term changes in bio-ecology. We can envision technologies where a sampled individual’s genome-skim can be used to quickly estimate its genome-length, repeat structure, remove contaminating reads, identify the organism or place it confidently in the tree of life, and finally, identify the robustness of population through analysis of heterozygosity. Our paper contributes to the first step of this vision.

Methods

Comparing r1/L distribution over different sets

To compare two sets of values and see if the values in one set are greater than the other set, we used the Mann–Whitney U test. Formally, if X and Y are random samples from populations $X$ and $Y$ , the test statistic, U, is given by the number of times x is greater than y for all $(x, y) \in X \times Y$ . The Mann–Whitney U test is non-parametric and does not restrict the samples to be from a certain family of distributions. The test also allows the user to specify a location shift μ and examine the alternative hypothesis that X − Y > μ. By gradually increasing μ and computing the p-value, we can understand the extent of difference between X and Y.

To test if two sets of numbers are drawn from the same distribution, we used the two-sample Kolmogorov–Smirnov (KS) test. The test statistic is a distance between the empirical distributions functions of the samples from the two sets. We used R ‘stats’ package [31] to compute the p-values for both tests.

Modeling genomic parameters

We consider k-mers in a genome of length L and assume that k ≫ log₄ L so that any k-mer is unlikely to appear more than once, unless it is part of a repeated sequence. Denote the (unknown) k-mer spectrum of a genome that contains repeats using r, where r_j describes the number of distinct k-mers that appear exactly j times in the genome.

The genome is shotgun sequenced using reads of length ℓ with average sequencing depth c. The total number of nucleotides sequenced is given by B = cL. As there are l − k + 1 k-mers in each read, the k-mer coverage is given by

\begin{matrix} λ = (1 - (k - 1) / ℓ) c = \frac{(1 - (k - 1) / ℓ) B}{L} . \end{matrix}

(4)

Let o denote the histogram of observed k-mer counts. The observed number of k-mers of abundance h, o_h, can be thought of as a sample allocation to random variable O_h, whose expected value, $m_{h} = E [O_{h}]$ , depends upon r, λ, L, and also on sequencing error. We assume that any base-pair is sequenced erroneously with probability ϵ, and sequencing errors only result in novel k-mers. We further assume that the number of times a unique k-mer repeated j times is sampled follows a Poisson distribution with rate λj(1 − ϵ)^k. Therefore

\begin{matrix} m = r P^{T} + 1_{h = 1} E, \end{matrix}

(5)

where $P_{h j} = e^{- j λ {(1 - ϵ)}^{k}} \frac{{(j λ {(1 - ϵ)}^{k})}^{h}}{h!}$ denotes the probability that a k-mer repeated j times in the genome is observed with count h in the genome skim, 1_h=1 = [1, 0, 0, …], and E = Lλ(1 − (1 − ϵ)^k) is the expected number of erroneous k-mers. As λ and L are connected through Eq 4, we choose λ as the independent variable and consider L as a function of λ. Under this model, we would like to estimate $(r, λ, ϵ) = arg {min}_{r, ϵ} E (P, r, ϵ, o)$ , where $E$ is a weighted p-norm of the difference between expected and observed counts

E_{w, p} (P, r, ϵ, o) = {(\sum_{h} w_{h} | m_{h} - o_{h} |^{p})}^{1 / p} = {(\sum_{h} w_{h} | {(r P^{T} + 1_{h = 1} E)}_{h} - o_{h} |^{p})}^{1 / p} .

(6)

Note that the optimization is non-trivial because P and E are functions of (r, λ, ϵ), and must be simultaneously estimated.

A generic iterative optimization for parameter estimation

The dimensions of o and r in Eq 5 are determined entirely by data and are not necessarily identical. However, we truncated both to a common dimension n = 50 for computational expediency. A generic optimization method could be described as below.

Generate initial estimates of λ, ϵ, L.
Solve for r using Eq 6.
Use estimated r and grid-search to re-estimate λ, ϵ.
Repeat step 2 onwards until the error has converged.

Step 2 is the key step in this procedure, and we devised a number of approaches to solve it.

Least-squares estimate of repeat spectrum

Choosing p = 2 (Euclidean norm) and w_h = 1, ∀h in Eq 6, the problem is turned into a Least-Squares (LS) optimization. To test an LS method for estimating r, we considered the simplest sequencing-error-free case (ϵ = 0), where coverage λ was known. Therefore, $E [O] = m = r P^{T}$ , where P is an n × n matrix with

P_{h j} = e^{- j λ} \frac{{(j λ)}^{h}}{h!} .

We showed (S1 Appendix) that P is non-singular and in the error-free case, it should be possible to use the estimate r^(est) = oP^−T. However, we observed that its effective rank was very small as Λ, E each have rapidly diminishing eigenvalues. Therefore, instead of decomposing P and explicitly computing P⁻¹, we used the non-negative least squares (NNLS) method [32] to solve

r^{(est)} = arg min_{r} {‖ o - r P^{T} ‖}_{2} .

We used nnls method from SciPy’s [33] Optimize library. Unfortunately, the LS estimates were very unreliable and showed high error. In fact, we proved, for λ = 1 (see S1 Appendix), that

c o n d (P) \geq \frac{2^{n}}{n} .

The condition number grows exponentially with n suggesting a highly ill-conditioned matrix P where small changes in o from the expected values m would lead to large errors in estimate of r. For these reasons, we adopted constrained optimization methods to solve for r.

Linear programming for constrained optimization based estimates

We used Eq 6 with w = [0, 1, 1, …, 1] and p = 1 to design a Linear programming estimate of r as:

\begin{matrix} min_{r} \sum_{h = 2}^{n} | o_{h} - \sum_{j = 1}^{n} P_{h j} r_{j} |, \end{matrix}

(7)

such that

L_{h} \leq \frac{r_{h}}{r_{h + 1}} \leq U_{h}, h = 1, 2, \dots, n - 1

The rationale behind setting w₁ = 0 was that o₁ contains a large number of erroneous k-mers, so we exclude it from the objective function and use the rest of the bins to estimate r. As ϵ is not known in general, o₁ was used to estimate the (average) sequencing error rate, and subsequently the k-mer coverage λ.

The lower and upper bounds on $\frac{r_{j}}{r_{j + 1}}$ were determined based on the distribution R_j of spectral ratios in 556 training genomes, and therefore we only search for candidate solutions r that satisfy the constraints. Specifically, we profiled the repeat spectra of the training genomes and set $[L_{j}, U_{j}]$ equal to the empirical support of R_j distribution, i.e., $L_{j}$ and $U_{j}$ are the smallest and the largest samples observed from R_j over the training genomes. We use Gurobi Optimizer [34] to solve the constrained optimization problem formulated in Eq 7.

Spline Linear programming

The final method of estimating r is based on the LP estimate of r and the splines fitted on spectral ratios r_j/r_j+1 as functions of $\frac{r_{j}}{\sum_{i \geq j} r_{i}}$ . Formally, let $r_{j}^{LP}$ denote the LP estimate of r_j by constraining the spectral ratios to be within the support of R_j among the training genomes, as discussed above. For each j ∈ {1, 2, 3, 4, 5}, we used a generalized additive model (GAM), learned from 556 training genomes, to predict r_j/r_j+1 based on $\frac{r_{j}^{LP}}{\sum_{i \geq j} r_{i}^{LP}}$ . Specifically, we model y_j = r_j/r_j+1 for different genomes as samples drawn from dependent random variable Y_j, which follows gamma distribution and its mean is determined by

\begin{matrix} g_{j} (E [Y_{j}]) = s_{j} (\frac{r_{j}}{\sum_{i \geq j} r_{i}}), \end{matrix}

(8)

where g_j is called the link function, and s_j is the smoothing spline. These functions allow us to capture nonlinear dependencies between the variables in our model. For j = 1, 2, we use a logarithmic link function to account for the large dynamic range of r_j/r_j+1 over the training set, and use identity link for j = 3, 4, 5. For each fitted GAM, we empirically set the smoothing parameter to balance the over-fitting against the goodness of fit. We used R ‘mgcv’ package [35] for GAM fitting.

Using the LP estimates of r_j’s and plugging them into Eq 8, we predict the spectral ratios. Let $y_{j}^{SLP}$ denote the estimate of y_j using Eq 8 on previous estimates of r. We recursively re-estimate r_j for j ∈ {2, 3, 4, 5, 6} and call them $r_{j}^{SLP}$ :

\begin{matrix} r_{j}^{SLP} = {\begin{matrix} r_{j}^{LP} & j = 1 and j > 6 \\ r_{j - 1}^{SLP} / y_{j - 1}^{SLP} & 2 \leq j \leq 6 \end{matrix} \end{matrix}

(9)

RESPECT algorithm

For the RESPECT algorithm, we replaced the basic iterative method described above with a simulated annealing procedure outlined in Algorithm 1 to speed up the computations. To initialize the algorithm, we started with the assumptions that genome has no repeats r = [L, 0, 0, …], and the error-free k-mer counts follow a Poisson distribution (Eq 5). Defining λ_ef = λ(1 − ϵ)^k as the error-free k-mer coverage, we estimate its initial value from the ratio of observed counts

λ_{ef} = \frac{(h^{*} + 1) o_{h^{*} + 1}}{o_{h^{*}}}, where h^{*} = \underset{h > 1}{arg max} o_{h},

and set

λ = e^{- λ_{ef}} \frac{λ_{ef}^{h^{*}}}{h^{*}!} \frac{o_{1}}{o_{h^{*}}} + λ_{ef} (1 - e^{- λ_{ef}}), ϵ = 1 - {(λ_{ef} / λ)}^{1 / k}

(see S1 Appendix). The above estimate of ϵ is used throughout the algorithm, but is corrected at the end based on the estimated uniqueness ratio (described below). Using the estimate of λ_ef, we compute P, and thus the error function $E$ at the start of the algorithm. For $E$ , we chose w = [0, 1, 1, …, 1] and p = 1 in Eq 6, so

E = \sum_{h = 2}^{n} | o_{h} - \sum_{j = 1}^{n} P_{h j} r_{j} |

With the initial values of the parameters known, RESPECT runs a simulated annealing optimization until the error converges. At each iteration, a candidate λ_next in $[\frac{1}{2} λ, 3 λ]$ is selected uniformly at random, and P_next is computed from λ_next(1 − ϵ)^k. Next, we run SLP method on (o, P_next) to get r_next. Throughout the algorithm, we used truncated o_1×m, r_1×n, and P_m×n where the number of spectra is fixed at n = 50 (a reasonable compromise between accuracy and speed), and the number of observed counts m = n ⋅ max(1, λ_ef) scales proportionally with the initial estimate of error-free k-mer coverage. Using (o, P_next, r_next), error function for the candidate state $E_{next}$ is calculated. If moving to the candidate state results in a reduction in the error ( $E_{next} < E$ ), the algorithm accepts the move and updates the current estimate of parameters. In addition, to help the algorithm deal with local minima and find better solutions, a simulated annealing scheme is implemented such that the algorithm probabilistically decides to move to states with higher error. Specifically, at iteration t, even if $E_{next} > E$ , the algorithm accepts the move with probability $exp (- (E_{next} - E) t / N)$ .

At the end of iterations, the initial estimate of ϵ (obtained under no-repeats assumption) is corrected based on the estimated value of r₁/L. The correction was learned over 120 genomes randomly selected from the training set, and applied if the estimated coverage is smaller than 1.5X. Then, λ is re-computed based on the corrected ϵ, and is used to compute the final estimates of coverage and genome length. The estimated sequencing error rate and repeat spectrum are also provided by the algorithm.

Algorithm 1: The RESPECT method

Start with $λ_{ef} = λ^{(0)} {(1 - ϵ)}^{k} = \frac{(h^{*} + 1) o_{h^{*} + 1}}{o_{h^{*}}}$ , where h* = arg max_h>1 o_h;

Compute P⁽⁰⁾, $E^{(0)} = {min}_{r} E (P^{(0)}, r^{(0)}, o)$ , and $r^{(0)} = arg {min}_{r} E (P^{(0)}, r^{(0)}, o)$ ;

Find $E = o_{1} - \sum_{j} P_{1 j}^{(0)} r_{j}^{(0)}$ ;

Set $λ^{(0)} = e^{- λ_{ef}} \frac{λ_{ef}^{h^{*}}}{h^{*}!} \frac{o_{1}}{o_{h^{*}}} + λ_{ef} (1 - e^{- λ_{ef}})$ , and compute ϵ from λ_ef and λ⁽⁰⁾;

for 1 ≤ t ≤ N do

$λ^{(t)} \leftarrow U [\frac{1}{2} \cdot λ^{(t - 1)}, 3 \cdot λ^{(t - 1)}]$ ;

Use λ^(t) and ϵ to compute P^(t), $r^{(t)} = arg {min}_{r} E (P^{(t)}, r^{(t)}, o)$ , and

$E^{(t)} = {min}_{r} E (P^{(t)}, r^{(0)}, o)$ ;

Move to λ^(t) with probability $min {1, exp (\frac{E^{(t - 1)} - E^{(t)}}{N - t + 1})}$ ;

end

Correct ϵ and set λ = λ^(N)(1 − ϵ)^k/(1 − ϵ_corrected)^k;

Output $c = \frac{ℓ}{ℓ - k + 1} λ$ , L = B/c, ϵ_corrected, and r^(N)

SRA preprocessing and contamination filtering

After downloading SRA accessions and converting them to FASTQ using SRA Toolkit [36], we used BBDuk and Dedupe from BBTools package to trim adapter sequences and remove duplicate reads. We then ran Kraken2 to remove contamination with prokaryotic or human origin. For plant and invertebrate samples, we filtered out any read that was classified to the Kraken database at 0 confidence level (very sensitive, a single matched k-mer is enough for the classification). For vertebrates, due to their smaller evolutionary distance to homo sapiens, we required 0.5 confidence level (more specific, half of the read’s k-mers should match) for human classification, and 0 confidence level for everything else in the database.

Implementation details and running time

We use ‘count’ and ‘histo’ commands from Jellyfish [37] command line tool to compute the k-mer histogram of input genome-skims. In each iteration of RESPECT algorithm, we solve a constrained optimization problem using the tools provided by Gurobi Python interface in ‘gurobipy’ package. The running time of RESPECT slowly increases with the coverage as the size of P (and hence the size of optimization problem at each iteration) scales with the (initial) estimate of coverage. On average, for a typical 0.5X-4X coverage of genome-skims, it takes about 2 hours for RESPECT algorithm to converge and produce the final estimate of the parameters.

Selecting species with known recent WGD events

From the total of 83 RefSeq genomes in our database, we obtained the WGD annotation (with estimated age) for 44 plant species [29]. WGD annotations for the remaining 32 plant species in our database were based on the data provided by the 1000 plants project [30], where either the exact same species or a species from the same genus is identified to have undergone a WGD event using transcriptomic data. We also have 7 Salmonid genomes where their common ancestor is thought to have had a WGD event about 80My ago [28].

Statistical analysis of the repeat structure

In a random genome with length L, there are L − k + 1 ≃ L k-mers, and assuming the random selection of k-mers is uniform over the space of all 4^k possible k-mers, the probability distribution for the copy number (CN) of each k-mer is

\begin{matrix} Prob [CN = x] = (\binom{L}{x}) {(\frac{1}{4^{k}})}^{x} {(1 - \frac{1}{4^{k}})}^{L - x} . \end{matrix}

For typical values of L ∼ 100 − 1000 Mbp and k = 31, the conditions to use a Poisson distribution to approximate a Binomial (see e.g., Section 5.4 of [38]) are met, i.e., L ≫ 1 and 4^−k ≪ 1, hence we have

\begin{matrix} Prob [CN = x] = e^{- L / 4^{k}} \frac{{(L / 4^{k})}^{x}}{x!} . \end{matrix}

If the genome subsequently undergoes n_w whole genome duplication events, the genome length is multiplied by $2^{n_{w}}$ . However, the multiplicity of each k-mer increases by at most $2^{n_{w}}$ , as mutations reduce the copy number of k-mers. Therefore, to have an HCRM value of H, there should exist at least a k-mer with copy number x ≥ HL in the original random genome. Now, considering that under random-genome model the selection of any k-mer is equally likely, we can use the union bound (see e.g., Section 1.5 of [38]) and have

\begin{matrix} Prob [HCRM \geq H] & < \sum_{\begin{matrix} all \\ k -mers \end{matrix}} \sum_{x = H L} e^{- L / 4^{k}} \frac{{(L / 4^{k})}^{x}}{x!} \\ < 4^{k} \sum_{x = H L} e^{- L / 4^{k}} \frac{{(L / 4^{k})}^{x}}{x!} . \end{matrix}

(10)

We used WolframAlpha [39] to compute the bound in (10) for several values of H. For H = 200 and L ∈ [100 − 1000] Mbp, the resulting p-values were less than 10⁻¹⁰⁰.

To test the association between WGD events and the values of r₁/L and HCRM, we used the assembled genomes of 622 RefSeq species and constructed a two by two contingency table where columns represent the species with or without an identified recent WGD, and the rows specify whether or not the genome has r₁/L and HCRM values less than 0.8 and 200, respectively. We filled the table by the count of genomes that satisfied each of these four conditions, and performed a Fisher’s exact test (using R ‘stats’ package [31]) and got the p-value = 1.8 × 10⁻²³ for the correlation between the rows and columns of the table.

Supporting information

S1 Appendix. Supplementary methods and data.

Detailed mathematical derivations and supplementary tables. Table A: SRA preprocessing results. Table B: List of species with recent WGD events.

(PDF)

Click here for additional data file.^{(427.8KB, pdf)}

S1 Fig. Whole RefSeq taxonomy with r₁/L annotation.

A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(4.3MB, tif)}

S2 Fig. Distributions of intra-generic versus inter-generic differences in r1/L for pairs of RefSeq species.

A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(492.3KB, tif)}

S3 Fig. Correlation of r₁/L with spectral ratios.

A: r₁/r₃ versus r₁/L, B: r₁/r₅ versus r₁/L.

(TIF)

Click here for additional data file.^{(379.3KB, tif)}

S4 Fig. Comparing the distributions of r1/L among test and all RefSeq genomes.

The p-value for the hypothesis that the distributions are different using two-sided Kolmogorov–Smirnov test is 0.93. Highly-repetitive genomes are slightly over-represented in the test set.

(TIF)

Click here for additional data file.^{(111.4KB, tif)}

S5 Fig. Correlation between true r₄/r₃ and estimated r₃/∑_i=3 r_i.

(TIF)

Click here for additional data file.^{(310.5KB, tif)}

S6 Fig. Correlation between true r₅/r₄ and estimated r₄/∑_i=4 r_i.

(TIF)

Click here for additional data file.^{(307.5KB, tif)}

S7 Fig. Correlation between true r₆/r₅ and estimated r₅/∑_i=5 r_i.

(TIF)

Click here for additional data file.^{(304.7KB, tif)}

S8 Fig. Correlation between the relative error in the estimated sequencing error and the uniqueness ratio.

A subset of 120 training genomes were selected as the cross-validation set, and genome-skims were simulated at 1X coverage with 1% sequencing error rate. There is a strong correlation (R = −0.995) between the error in estimating ϵ and r₁/L ratio. We capped the correction at 20% (red dashed line).

(TIF)

Click here for additional data file.^{(206.5KB, tif)}

S9 Fig. r₁ estimation convergence with time.

(TIF)

Click here for additional data file.^{(161.1KB, tif)}

S10 Fig. r₂ estimation convergence with time.

(TIF)

Click here for additional data file.^{(171.3KB, tif)}

S11 Fig. r₃ estimation convergence with time.

(TIF)

Click here for additional data file.^{(166.5KB, tif)}

S12 Fig. r₄ estimation convergence with time.

(TIF)

Click here for additional data file.^{(166.1KB, tif)}

S13 Fig. r₅ estimation convergence with time.

(TIF)

Click here for additional data file.^{(166.7KB, tif)}

S14 Fig. Genome length convergence with time.

(TIF)

Click here for additional data file.^{(154.5KB, tif)}

S15 Fig. Genome length estimation error of RESPECT and CovEst.

The coverage is 1X, and the y-axis is in square-root scale. The sign of error indicates overestimation or underestimation. The dashed lines mark the region that the absolute value of error is less than 5%.

(TIF)

Click here for additional data file.^{(659.7KB, tif)}

S16 Fig. Estimated to true genome length ratio.

Comparing RESPECT and CovEst over 66 test species with genomes skimmed at 1X coverage. The y-axis is plotted in log scale, and the red dashed line at y = 1 is the grand truth (no error). Two genomes (A. tauschii (0.002) and Z. mays (0.003)) that CovEst had extremely low estimated to true ratios were removed to improve readability.

(TIF)

Click here for additional data file.^{(648.2KB, tif)}

S17 Fig. Impact of training data on length estimation accuracy.

RESPECT was trained on a subset of genomes (50 of 129 mammalian genomes and 50 of 195 invertebrate genomes were removed), and the error plotted (circles) along with the error on the original training set (triangles). A: The error per genome is plotted in log scale on the y-axis. B: The distribution of error values with RESPECT trained on the subset (blue) and the entire data set (red).

(TIF)

Click here for additional data file.^{(1.1MB, tif)}

S18 Fig. Length estimation error on simulated data at different coverages.

The distribution of error made by RESPECT and CovEst in estimating the length of 66 test genomes skimmed at 0.5X, 1X, 2X, and 4X coverage. The y-axis is plotted in log scale.

(TIF)

Click here for additional data file.^{(296.5KB, tif)}

S19 Fig. Estimated to true genome length ratio.

Comparing RESPECT and CovEst over 66 test species with genomes skimmed at 0.5X coverage. The y-axis is plotted in log scale, and the red dashed line at y = 1 is the grand truth (no error). Four genomes (D. grimshawi (0.0004), S. salar (0.0006), A. tauschii (0.0012), and Z. mays (0.0016)) that CovEst had extremely low estimated to true ratios were removed to improve readability.

(TIF)

Click here for additional data file.^{(653.1KB, tif)}

S20 Fig. Estimated to true genome length ratio.

Comparing RESPECT and CovEst over 66 test species with genomes skimmed at 2X coverage. The y-axis is plotted in log scale, and the red dashed line at y = 1 is the grand truth (no error).

(TIF)

Click here for additional data file.^{(641.1KB, tif)}

S21 Fig. Estimated to true genome length ratio.

Comparing RESPECT and CovEst over 66 test species with genomes skimmed at 4X coverage. The y-axis is plotted in log scale, and the red dashed line at y = 1 is the grand truth (no error). Four genomes (D. grimshawi (0.0004), S. salar (0.0006), A. tauschii (0.0012), and Z. mays (0.0016)) that CovEst had extremely low estimated to true ratios were removed to improve readability.

(TIF)

Click here for additional data file.^{(639.1KB, tif)}

S22 Fig. Distribution of length estimation error over four major taxonomic groups.

Significant p-values (0.05 threshold) computed using Mann-Whitney U test are added to the plot. Plants and invertebrates have higher error rates compared to vertebrates species in our test dataset.

(TIF)

Click here for additional data file.^{(281.4KB, tif)}

S23 Fig. Length estimation error vs. uniqueness ratio.

Negative correlation between RESPECT’s error and uniqueness ratio of the genome.

(TIF)

Click here for additional data file.^{(166.2KB, tif)}

S24 Fig. Length estimation error for 10 bacterial genomes.

The 10 bacterial genomes were selected at random from RefSeq and genome-skims were simulated at 1X coverage. The relative error of the estimated length is plotted in log scale on the y-axis.

(TIF)

Click here for additional data file.^{(261.2KB, tif)}

S25 Fig. Whole RefSeq taxonomy with HCRM annotation.

Colors are based on logarithm of HCRM values for each genome. A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(4.6MB, tif)}

S26 Fig. Distributions of intra-generic versus inter-generic differences in HCRM for pairs of RefSeq species.

A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(498.3KB, tif)}

S27 Fig. High copy repeats per million versus uniqueness ratio among genomes with and without known recent WGD events.

HRCM values are computed directly from the genome assemblies.

(TIF)

Click here for additional data file.^{(386.4KB, tif)}

S28 Fig. Estimating genome length using SRA data.

RESPECT was test on 10 new samples (chosen at random) made available since the original submission of the manuscript. One of the samples was removed during the preprocessing due to high duplication rate. The results for the remaining 9 samples are plotted along with the original test species. Two newly added samples with high error are Z. cesonia and V. riparia. RESPECT overestimates their genome length by %28. It could be the case that the assemblies are missing some repetitive sequences (especially V. riparia which a has highly repetitive genome), considering that for both species there is a gap between reported total sequence length and total ungapped length.

(TIF)

Click here for additional data file.^{(382.6KB, tif)}

Data Availability

RESPECT software and the trained models are publicly available on https://github.com/shahab-sarmashghi/RESPECT under a BSD 3-Clause license. Run accessions of SRA files used in the tests are provided in Table A in S1 Appendix.

Funding Statement

VB and SS were supported in part by grants from the NSF (IIS-1815485) and from the NIH (1R01GM114362). MB, ER, and SM were supported in part by a grant from the NSF (IIS-1815485). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

1.Brondizio E, Settele J, Diaz S, Ngo H. Global assessment report on biodiversity and ecosystem services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services. IPBES Secretariat, Bonn. 2019.
2. Rosenberg KV, Dokter AM, Blancher PJ, Sauer JR, Smith AC, Smith PA, et al. Decline of the North American avifauna. Science. 2019; p. eaaw1313. [DOI] [PubMed] [Google Scholar]
3. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences. 2018;115(17):4325–4333. doi: 10.1073/pnas.1720115115 [DOI] [PMC free article] [PubMed] [Google Scholar]
4. Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences. 2003;270(1512):313–321. doi: 10.1098/rspb.2002.2218 [DOI] [PMC free article] [PubMed] [Google Scholar]
5. Savolainen V, Cowan RS, Vogler AP, Roderick GK, Lane R. Towards writing the encyclopaedia of life: an introduction to DNA barcoding. Philosophical Transactions of the Royal Society B: Biological Sciences. 2005;360(1462):1805–1811. doi: 10.1098/rstb.2005.1730 [DOI] [PMC free article] [PubMed] [Google Scholar]
6. TABERLET P, COISSAC E, POMPANON F, BROCHMANN C, WILLERSLEV E. Towards next-generation biodiversity assessment using DNA metabarcoding. Molecular Ecology. 2012;21(8):2045–2050. doi: 10.1111/j.1365-294X.2012.05470.x [DOI] [PubMed] [Google Scholar]
7. Hickerson MJ, Meyer CP, Moritz C, Hedin M. DNA Barcoding Will Often Fail to Discover New Animal Species over Broad Parameter Space. Systematic Biology. 2006;55(5):729–739. doi: 10.1080/10635150600969898 [DOI] [PubMed] [Google Scholar]
8. Quicke DLJ, Alex Smith M, Janzen DH, Hallwachs W, Fernandez-Triana J, Laurenne NM, et al. Utility of the DNA barcoding gene fragment for parasitic wasp phylogeny (Hymenoptera: Ichneumonoidea): Data release and new measure of taxonomic congruence. Molecular Ecology Resources. 2012;12(4):676–685. doi: 10.1111/j.1755-0998.2012.03143.x [DOI] [PubMed] [Google Scholar]
9. Liu S, Li Y, Lu J, Su X, Tang M, Zhang R, et al. SOAP Barcode: revealing arthropod biodiversity through assembly of Illumina shotgun sequences of PCR amplicons. Methods in Ecology and Evolution. 2013;4(12):1142–1150. doi: 10.1111/2041-210X.12120 [DOI] [Google Scholar]
10.DNAmark;. https://urldefense.proofpoint.com/v2/url?u=http-3A__dnamark.ku.dk_english_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=bLrJY2bZOaMwX7-wgqHMUFPmdwlC8mzmM_cfTqV6iYQ&e=.
11.France Génomique—Mutualisation des compétences et des équipements français pour l’analyse génomique et la bio-informatique;. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.france-2Dgenomique.org_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=qorpdKH7FcNJOO57GkUOQqRqoG8DOPSdBw9t9POHRLM&e=.
12.Coissac E, Hollingsworth PM, Lavergne S, Taberlet P. From barcodes to genomes: Extending the concept of DNA barcoding; 2016. [DOI] [PubMed]
13. Bohmann K, Mirarab S, Bafna V, Gilbert MTP. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Molecular Ecology. 2020;29(14):2521–2534. doi: 10.1111/mec.15507 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Sarmashghi S, Bohmann K, Gilbert PMT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol. 2019;20(1):34. doi: 10.1186/s13059-019-1632-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Balaban M, Sarmashghi S, Mirarab S. APPLES: Scalable Distance-based Phylogenetic Placement with or without Alignments. Systematic Biology. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
16. Rachtman E, Bafna V, Mirarab S. CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics. 2021;3(3). doi: 10.1093/nargab/lqab071 [DOI] [PMC free article] [PubMed] [Google Scholar]
17. Li X, Waterman MS. Estimating the repeat structure and length of DNA sequences using L-tuples. Genome research. 2003;13(8):1916–22. doi: 10.1101/gr.1251803 [DOI] [PMC free article] [PubMed] [Google Scholar]
18. Williams D, Trimble WL, Shilts M, Meyer F, Ochman H. Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics. 2013. doi: 10.1186/1471-2164-14-537 [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hozza M, Vinař T, Brejová B. How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra. In: String Processing and Information Retrieval. Cham: Springer International Publishing; 2015. p. 199–209.
20. Melsted P, Pritchard JK. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011. doi: 10.1186/1471-2105-12-333 [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Melsted P, Halldórsson BV. KmerStream: Streaming algorithms for k-mer abundance estimation. Bioinformatics. 2014. doi: 10.1093/bioinformatics/btu713 [DOI] [PubMed] [Google Scholar]
22. Wahba G. Spline models for observational data. SIAM; 1990. [Google Scholar]
23. Hastie TJ, Tibshirani RJ. Generalized additive models. vol. 43. CRC press; 1990. [Google Scholar]
24. Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucleic acids research. 2010;39(suppl_1):D19–D21. doi: 10.1093/nar/gkq1019 [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Bushnell B. BBMap;. https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_bbmap_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=wHMG_abosIk1qjWX1pSjNSge27HY8IrvhOxQ-rQlbDA&e=.
26. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome biology. 2019;20(1):257. doi: 10.1186/s13059-019-1891-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Rachtman E, Balaban M, Bafna V, Mirarab S. The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Molecular Ecology Resources. 2020;20(3):1755–0998.13135. doi: 10.1111/1755-0998.13135 [DOI] [PubMed] [Google Scholar]
28. Lien S, Koop BF, Sandve SR, Miller JR, Kent MP, Nome T, et al. The Atlantic salmon genome provides insights into rediploidization. Nature. 2016;533(7602):200–205. doi: 10.1038/nature17164 [DOI] [PMC free article] [PubMed] [Google Scholar]
29. Van de Peer Y, Mizrachi E, Marchal K. The evolutionary significance of polyploidy. Nature Reviews Genetics. 2017;18(7):411. doi: 10.1038/nrg.2017.26 [DOI] [PubMed] [Google Scholar]
30. Initiative OTPT, et al. One thousand plant transcriptomes and the phylogenomics of green plants. Nature. 2019;574(7780):679. doi: 10.1038/s41586-019-1693-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
31.R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.R-2Dproject.org_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=Cn5NMJYc-_vmoyFtIIR3uzMmsnMwX_mfKBxC8g0JxpE&e=.
32. Lawson CL, Hanson RJ. Solving least squares problems. SIAM; 1995. [Google Scholar]
33. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual; 2020. https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gurobi.com&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=C1GiSoqoq4vgbUiZw5Nfxx4IQ_LwAUsssTIgH041GBo&e=.
35. Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B). 2011;73(1):3–36. doi: 10.1111/j.1467-9868.2010.00749.x [DOI] [Google Scholar]
36.SRA Toolkit Development Team. SRA-Tools;. https://urldefense.proofpoint.com/v2/url?u=http-3A__ncbi.github.io_sra-2Dtools_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=rWyVMENufclEbfQE9Tiwjfo_jkVRcVm43kgcguo4hfI&e=.
37. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011 [DOI] [PMC free article] [PubMed] [Google Scholar]
38. DeGroot MH, Schervish MJ. Probability and statistics. Pearson Education; 2012. [Google Scholar]
39.Wolfram Alpha LLC. Wolfram|Alpha;. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.wolframalpha.com_widgets_view.jsp-3Fid-3D74e8bb60ad4e38d6a1b0dc865d7197ff&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=9cFzZ5HZsLK7ML6fRuCQqu7cakKiK5mvW9czOHOTXXM&e=.

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009449.r001

Decision Letter 0

Nicola Segata, Sushmita Roy

17 Jun 2021

Dear Mr. Sarmashghi,

Thank you very much for submitting your manuscript "Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT" for consideration at PLOS Computational Biology. As with all papers reviewed by the journal, your manuscript was reviewed by members of the editorial board and by several independent reviewers. The reviewers appreciated the attention to an important topic. Based on the reviews, we are likely to accept this manuscript for publication, providing that you modify the manuscript according to the review recommendations.

Please prepare and submit your revised manuscript within 30 days. If you anticipate any delay, please let us know the expected resubmission date by replying to this email.

When you are ready to resubmit, please upload the following:

[1] A letter containing a detailed list of your responses to all review comments, and a description of the changes you have made in the manuscript. Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out

[2] Two versions of the revised manuscript: one with either highlights or tracked changes denoting where the text has been changed; the other a clean version (uploaded as the manuscript file).

Important additional instructions are given below your reviewer comments.

Thank you again for your submission to our journal. We hope that our editorial process has been constructive so far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Nicola Segata

Associate Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

***********************

A link appears below if there are any accompanying review attachments. If you believe any reviews to be missing, please contact ploscompbiol@plos.org immediately:

[LINK]

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: The authors present a new tool called RESPECT to perform estimation of genome length, coverage and repeat content from high-throughput shotgun sequencing reads. They show that their tool provides improved performance compared to other similar alternatives (namely CovEst). Although this is a tool with a highly specific utility, the authors have done a good job in describing the rationale behind their approach and providing sufficient details on how their method works. I think this will be a valuable contribution to the field. I have a few comments that I believe would make the tool even more useful and convincing.

1) In the manuscript, RESPECT is only evaluated against eukaryotic genomes. Given that the majority of sequence repositories are dominated by prokaryotic sequences, it would be very useful to understand how their tool behaves with bacterial genomes as well, at least in predicting genome length and coverage. In principle, I see no obvious reason why it should not work, aside from the fact that bacterial genomes are much less repetitive. This would also provide a more diverse sequence set to understand if the accuracy of their method is biased to particular taxonomic groups.

2) For Fig. 3 it would be useful to evaluate the length error by taxonomy. There seems to be a big range in the accuracy of their method (from 0.1% to 50%). Is their algorithm better at predicting the length of certain taxa as opposed to others?

3) I wonder if it would be possible to further improve their genome length predictions by taking into account prior knowledge on the corresponding taxa. For instance, the authors could perform a first-pass k-mer assignment of the sequence data to a database like RefSeq to determine which taxonomic group is dominated in the data. Based on this match, they would deduce that the predicted genome length would likely sit between a certain range determined by the known genomes (e.g., in RefSeq) within that taxa. This could potentially resolve some cases where the error rates are >5-10%.

4) I have some concerns regarding the test/training datasets used to evaluate their tool. I understand the authors used 66 species as the test dataset, representing the diversity of RefSeq. However, how phylogenetically balanced were the remaining 556 genomes used for training? I imagine there were clear biases for certain taxonomic groups. How did the authors account for this?

5) The number of public, sequenced datasets used for benchmarking is relatively small (29 species). Why did the authors not test a larger and more diverse set of public data? This raises the question on how representative the results shown are of what users will actually encounter in their own datasets.

6) The analysis presented in the manuscript is strictly focused on short read data (both simulated and publicly available datasets). Given the rise of long-read sequencing, it would be important to assess how their tool copes with PacBio/nanopore sequence data. At the very least, the authors should further evaluate the accuracy of RESPECT with higher error rates typical of long-read data (it seems they have only tested a 1% sequencing error rate).

7) The font size of panels C and D of Figure 1 is extremely small. Suggest increasing to improve readability.

Reviewer #2: While I didn't have any trouble installing the dependencies it would be useful to provide a conda package or docker container prior final publication. While not required for publication it would also be useful to test on sequencing platforms aside from Illumina, specifically if it also works on PacBio CCS data.

Reviewer #3: The paper presents an interesting technique for leveraging low-coverage sequencing data to estimate important characteristics of the underlying genome, namely the genome’s length and its k-mer repeat spectrum. The problem is well-motivated as the k-mer repeat spectrum is an informative statistic that can yield, for example, the genomic diversity across individuals in a population, and the ability to determine the spectrum with a low-computational cost makes it an attractive alternative to computationally expensive methods such as complete genome assembly. The presented approach is shown to improve on previous approaches to this problem.

An additional contribution of the paper is in providing both theoretical and empirical justification for why an initial optimization approach fails to accurately estimate the k-mer repeat spectrum. The paper points out that the initial approach fails because its estimate of the spectrum is sensitive to small differences in the initial sequencing data. The novel optimization approach introduced in the paper seeks to address this difficulty by imposing constraints on the estimate of the spectrum.

The paper is well-written, and the presented method works well even at < 1 coverage. As such, I recommend accepting the paper.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data and code underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data and code should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data or code —e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

Figure Files:

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org.

Data Requirements:

Please note that, as a condition of publication, PLOS' data policy requires that you make available all data used to draw the conclusions outlined in your manuscript. Data must be deposited in an appropriate repository, included within the body of the manuscript, or uploaded as supporting information. This includes all numerical values that were used to generate graphs, histograms etc.. For an example in PLOS Biology see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5.

Reproducibility:

To enhance the reproducibility of your results, we recommend that you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option to publish peer-reviewed clinical study protocols. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

References:

Review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript.

If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

PLoS Comput Biol. 2021 Nov 15;17(11):e1009449. doi: 10.1371/journal.pcbi.1009449.r002

Author response to Decision Letter 0

26 Aug 2021

Attachment

Submitted filename: RESPECT-reviews-response.docx

Click here for additional data file.^{(483.6KB, docx)}

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009449.r003

Decision Letter 1

Nicola Segata, Sushmita Roy

13 Sep 2021

Dear Mr. Sarmashghi,

We are pleased to inform you that your manuscript 'Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT' has been provisionally accepted for publication in PLOS Computational Biology.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow up email. A member of our team will be in touch with a set of requests.

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us now if you or your institution is planning to press release the article. All press must be co-ordinated with PLOS.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Computational Biology.

Best regards,

Nicola Segata

Associate Editor

PLOS Computational Biology

Sushmita Roy

Deputy Editor

PLOS Computational Biology

***********************************************************

Reviewer's Responses to Questions

Comments to the Authors:

Please note here if the review is uploaded as an attachment.

Reviewer #1: I thank the authors for addressing all my comments and I have no further concerns.

**********

Have the authors made all data and (if applicable) computational code underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS Comput Biol. doi: 10.1371/journal.pcbi.1009449.r004

Acceptance letter

Nicola Segata, Sushmita Roy

9 Nov 2021

PCOMPBIOL-D-21-00449R1

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

Dear Dr Bafna,

I am pleased to inform you that your manuscript has been formally accepted for publication in PLOS Computational Biology. Your manuscript is now with our production department and you will be notified of the publication date in due course.

The corresponding author will soon be receiving a typeset proof for review, to ensure errors have not been introduced during production. Please review the PDF proof of your manuscript carefully, as this is the last chance to correct any errors. Please note that major changes, or those which affect the scientific understanding of the work, will likely cause delays to the publication date of your manuscript.

Soon after your final files are uploaded, unless you have opted out, the early version of your manuscript will be published online. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

Thank you again for supporting PLOS Computational Biology and open-access publishing. We are looking forward to publishing your work!

With kind regards,

Olena Szabo

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Supplementary methods and data.

Detailed mathematical derivations and supplementary tables. Table A: SRA preprocessing results. Table B: List of species with recent WGD events.

(PDF)

Click here for additional data file.^{(427.8KB, pdf)}

S1 Fig. Whole RefSeq taxonomy with r₁/L annotation.

A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(4.3MB, tif)}

S2 Fig. Distributions of intra-generic versus inter-generic differences in r1/L for pairs of RefSeq species.

A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(492.3KB, tif)}

S3 Fig. Correlation of r₁/L with spectral ratios.

A: r₁/r₃ versus r₁/L, B: r₁/r₅ versus r₁/L.

(TIF)

Click here for additional data file.^{(379.3KB, tif)}

S4 Fig. Comparing the distributions of r1/L among test and all RefSeq genomes.

The p-value for the hypothesis that the distributions are different using two-sided Kolmogorov–Smirnov test is 0.93. Highly-repetitive genomes are slightly over-represented in the test set.

(TIF)

Click here for additional data file.^{(111.4KB, tif)}

S5 Fig. Correlation between true r₄/r₃ and estimated r₃/∑_i=3 r_i.

(TIF)

Click here for additional data file.^{(310.5KB, tif)}

S6 Fig. Correlation between true r₅/r₄ and estimated r₄/∑_i=4 r_i.

(TIF)

Click here for additional data file.^{(307.5KB, tif)}

S7 Fig. Correlation between true r₆/r₅ and estimated r₅/∑_i=5 r_i.

(TIF)

Click here for additional data file.^{(304.7KB, tif)}

S8 Fig. Correlation between the relative error in the estimated sequencing error and the uniqueness ratio.

(TIF)

Click here for additional data file.^{(206.5KB, tif)}

S9 Fig. r₁ estimation convergence with time.

(TIF)

Click here for additional data file.^{(161.1KB, tif)}

S10 Fig. r₂ estimation convergence with time.

(TIF)

Click here for additional data file.^{(171.3KB, tif)}

S11 Fig. r₃ estimation convergence with time.

(TIF)

Click here for additional data file.^{(166.5KB, tif)}

S12 Fig. r₄ estimation convergence with time.

(TIF)

Click here for additional data file.^{(166.1KB, tif)}

S13 Fig. r₅ estimation convergence with time.

(TIF)

Click here for additional data file.^{(166.7KB, tif)}

S14 Fig. Genome length convergence with time.

(TIF)

Click here for additional data file.^{(154.5KB, tif)}

S15 Fig. Genome length estimation error of RESPECT and CovEst.

(TIF)

Click here for additional data file.^{(659.7KB, tif)}

S16 Fig. Estimated to true genome length ratio.

(TIF)

Click here for additional data file.^{(648.2KB, tif)}

S17 Fig. Impact of training data on length estimation accuracy.

(TIF)

Click here for additional data file.^{(1.1MB, tif)}

S18 Fig. Length estimation error on simulated data at different coverages.

The distribution of error made by RESPECT and CovEst in estimating the length of 66 test genomes skimmed at 0.5X, 1X, 2X, and 4X coverage. The y-axis is plotted in log scale.

(TIF)

Click here for additional data file.^{(296.5KB, tif)}

S19 Fig. Estimated to true genome length ratio.

(TIF)

Click here for additional data file.^{(653.1KB, tif)}

S20 Fig. Estimated to true genome length ratio.

Comparing RESPECT and CovEst over 66 test species with genomes skimmed at 2X coverage. The y-axis is plotted in log scale, and the red dashed line at y = 1 is the grand truth (no error).

(TIF)

Click here for additional data file.^{(641.1KB, tif)}

S21 Fig. Estimated to true genome length ratio.

(TIF)

Click here for additional data file.^{(639.1KB, tif)}

S22 Fig. Distribution of length estimation error over four major taxonomic groups.

Significant p-values (0.05 threshold) computed using Mann-Whitney U test are added to the plot. Plants and invertebrates have higher error rates compared to vertebrates species in our test dataset.

(TIF)

Click here for additional data file.^{(281.4KB, tif)}

S23 Fig. Length estimation error vs. uniqueness ratio.

Negative correlation between RESPECT’s error and uniqueness ratio of the genome.

(TIF)

Click here for additional data file.^{(166.2KB, tif)}

S24 Fig. Length estimation error for 10 bacterial genomes.

The 10 bacterial genomes were selected at random from RefSeq and genome-skims were simulated at 1X coverage. The relative error of the estimated length is plotted in log scale on the y-axis.

(TIF)

Click here for additional data file.^{(261.2KB, tif)}

S25 Fig. Whole RefSeq taxonomy with HCRM annotation.

Colors are based on logarithm of HCRM values for each genome. A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(4.6MB, tif)}

S26 Fig. Distributions of intra-generic versus inter-generic differences in HCRM for pairs of RefSeq species.

A: Plants, B: Invertebrates, C: Mammals, D: Other vertebrates.

(TIF)

Click here for additional data file.^{(498.3KB, tif)}

S27 Fig. High copy repeats per million versus uniqueness ratio among genomes with and without known recent WGD events.

HRCM values are computed directly from the genome assemblies.

(TIF)

Click here for additional data file.^{(386.4KB, tif)}

S28 Fig. Estimating genome length using SRA data.

(TIF)

Click here for additional data file.^{(382.6KB, tif)}

Attachment

Submitted filename: RESPECT-reviews-response.docx

Click here for additional data file.^{(483.6KB, docx)}

Data Availability Statement

[pcbi.1009449.ref001] 1.Brondizio E, Settele J, Diaz S, Ngo H. Global assessment report on biodiversity and ecosystem services of the Intergovernmental Science-Policy Platform on Biodiversity and Ecosystem Services. IPBES Secretariat, Bonn. 2019.

[pcbi.1009449.ref002] 2. Rosenberg KV, Dokter AM, Blancher PJ, Sauer JR, Smith AC, Smith PA, et al. Decline of the North American avifauna. Science. 2019; p. eaaw1313. [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref003] 3. Lewin HA, Robinson GE, Kress WJ, Baker WJ, Coddington J, Crandall KA, et al. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences. 2018;115(17):4325–4333. doi: 10.1073/pnas.1720115115 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref004] 4. Hebert PDN, Cywinska A, Ball SL, deWaard JR. Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences. 2003;270(1512):313–321. doi: 10.1098/rspb.2002.2218 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref005] 5. Savolainen V, Cowan RS, Vogler AP, Roderick GK, Lane R. Towards writing the encyclopaedia of life: an introduction to DNA barcoding. Philosophical Transactions of the Royal Society B: Biological Sciences. 2005;360(1462):1805–1811. doi: 10.1098/rstb.2005.1730 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref006] 6. TABERLET P, COISSAC E, POMPANON F, BROCHMANN C, WILLERSLEV E. Towards next-generation biodiversity assessment using DNA metabarcoding. Molecular Ecology. 2012;21(8):2045–2050. doi: 10.1111/j.1365-294X.2012.05470.x [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref007] 7. Hickerson MJ, Meyer CP, Moritz C, Hedin M. DNA Barcoding Will Often Fail to Discover New Animal Species over Broad Parameter Space. Systematic Biology. 2006;55(5):729–739. doi: 10.1080/10635150600969898 [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref008] 8. Quicke DLJ, Alex Smith M, Janzen DH, Hallwachs W, Fernandez-Triana J, Laurenne NM, et al. Utility of the DNA barcoding gene fragment for parasitic wasp phylogeny (Hymenoptera: Ichneumonoidea): Data release and new measure of taxonomic congruence. Molecular Ecology Resources. 2012;12(4):676–685. doi: 10.1111/j.1755-0998.2012.03143.x [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref009] 9. Liu S, Li Y, Lu J, Su X, Tang M, Zhang R, et al. SOAP Barcode: revealing arthropod biodiversity through assembly of Illumina shotgun sequences of PCR amplicons. Methods in Ecology and Evolution. 2013;4(12):1142–1150. doi: 10.1111/2041-210X.12120 [DOI] [Google Scholar]

[pcbi.1009449.ref010] 10.DNAmark;. https://urldefense.proofpoint.com/v2/url?u=http-3A__dnamark.ku.dk_english_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=bLrJY2bZOaMwX7-wgqHMUFPmdwlC8mzmM_cfTqV6iYQ&e=.

[pcbi.1009449.ref011] 11.France Génomique—Mutualisation des compétences et des équipements français pour l’analyse génomique et la bio-informatique;. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.france-2Dgenomique.org_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=qorpdKH7FcNJOO57GkUOQqRqoG8DOPSdBw9t9POHRLM&e=.

[pcbi.1009449.ref012] 12.Coissac E, Hollingsworth PM, Lavergne S, Taberlet P. From barcodes to genomes: Extending the concept of DNA barcoding; 2016. [DOI] [PubMed]

[pcbi.1009449.ref013] 13. Bohmann K, Mirarab S, Bafna V, Gilbert MTP. Beyond DNA barcoding: The unrealized potential of genome skim data in sample identification. Molecular Ecology. 2020;29(14):2521–2534. doi: 10.1111/mec.15507 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref014] 14. Sarmashghi S, Bohmann K, Gilbert PMT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol. 2019;20(1):34. doi: 10.1186/s13059-019-1632-4 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref015] 15. Balaban M, Sarmashghi S, Mirarab S. APPLES: Scalable Distance-based Phylogenetic Placement with or without Alignments. Systematic Biology. 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref016] 16. Rachtman E, Bafna V, Mirarab S. CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genomics and Bioinformatics. 2021;3(3). doi: 10.1093/nargab/lqab071 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref017] 17. Li X, Waterman MS. Estimating the repeat structure and length of DNA sequences using L-tuples. Genome research. 2003;13(8):1916–22. doi: 10.1101/gr.1251803 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref018] 18. Williams D, Trimble WL, Shilts M, Meyer F, Ochman H. Rapid quantification of sequence repeats to resolve the size, structure and contents of bacterial genomes. BMC Genomics. 2013. doi: 10.1186/1471-2164-14-537 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref019] 19.Hozza M, Vinař T, Brejová B. How Big is that Genome? Estimating Genome Size and Coverage from k-mer Abundance Spectra. In: String Processing and Information Retrieval. Cham: Springer International Publishing; 2015. p. 199–209.

[pcbi.1009449.ref020] 20. Melsted P, Pritchard JK. Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011. doi: 10.1186/1471-2105-12-333 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref021] 21. Melsted P, Halldórsson BV. KmerStream: Streaming algorithms for k-mer abundance estimation. Bioinformatics. 2014. doi: 10.1093/bioinformatics/btu713 [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref022] 22. Wahba G. Spline models for observational data. SIAM; 1990. [Google Scholar]

[pcbi.1009449.ref023] 23. Hastie TJ, Tibshirani RJ. Generalized additive models. vol. 43. CRC press; 1990. [Google Scholar]

[pcbi.1009449.ref024] 24. Leinonen R, Sugawara H, Shumway M, Collaboration INSD. The sequence read archive. Nucleic acids research. 2010;39(suppl_1):D19–D21. doi: 10.1093/nar/gkq1019 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref025] 25.Bushnell B. BBMap;. https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_projects_bbmap_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=wHMG_abosIk1qjWX1pSjNSge27HY8IrvhOxQ-rQlbDA&e=.

[pcbi.1009449.ref026] 26. Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome biology. 2019;20(1):257. doi: 10.1186/s13059-019-1891-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref027] 27. Rachtman E, Balaban M, Bafna V, Mirarab S. The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Molecular Ecology Resources. 2020;20(3):1755–0998.13135. doi: 10.1111/1755-0998.13135 [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref028] 28. Lien S, Koop BF, Sandve SR, Miller JR, Kent MP, Nome T, et al. The Atlantic salmon genome provides insights into rediploidization. Nature. 2016;533(7602):200–205. doi: 10.1038/nature17164 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref029] 29. Van de Peer Y, Mizrachi E, Marchal K. The evolutionary significance of polyploidy. Nature Reviews Genetics. 2017;18(7):411. doi: 10.1038/nrg.2017.26 [DOI] [PubMed] [Google Scholar]

[pcbi.1009449.ref030] 30. Initiative OTPT, et al. One thousand plant transcriptomes and the phylogenomics of green plants. Nature. 2019;574(7780):679. doi: 10.1038/s41586-019-1693-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref031] 31.R Core Team. R: A Language and Environment for Statistical Computing; 2019. Available from: https://urldefense.proofpoint.com/v2/url?u=https-3A__www.R-2Dproject.org_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=Cn5NMJYc-_vmoyFtIIR3uzMmsnMwX_mfKBxC8g0JxpE&e=.

[pcbi.1009449.ref032] 32. Lawson CL, Hanson RJ. Solving least squares problems. SIAM; 1995. [Google Scholar]

[pcbi.1009449.ref033] 33. Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref034] 34.Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual; 2020. https://urldefense.proofpoint.com/v2/url?u=http-3A__www.gurobi.com&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=C1GiSoqoq4vgbUiZw5Nfxx4IQ_LwAUsssTIgH041GBo&e=.

[pcbi.1009449.ref035] 35. Wood SN. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society (B). 2011;73(1):3–36. doi: 10.1111/j.1467-9868.2010.00749.x [DOI] [Google Scholar]

[pcbi.1009449.ref036] 36.SRA Toolkit Development Team. SRA-Tools;. https://urldefense.proofpoint.com/v2/url?u=http-3A__ncbi.github.io_sra-2Dtools_&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=rWyVMENufclEbfQE9Tiwjfo_jkVRcVm43kgcguo4hfI&e=.

[pcbi.1009449.ref037] 37. Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–770. doi: 10.1093/bioinformatics/btr011 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pcbi.1009449.ref038] 38. DeGroot MH, Schervish MJ. Probability and statistics. Pearson Education; 2012. [Google Scholar]

[pcbi.1009449.ref039] 39.Wolfram Alpha LLC. Wolfram|Alpha;. https://urldefense.proofpoint.com/v2/url?u=https-3A__www.wolframalpha.com_widgets_view.jsp-3Fid-3D74e8bb60ad4e38d6a1b0dc865d7197ff&d=DwIGAw&c=-35OiAkTchMrZOngvJPOeA&r=ZozViWvD1E8PorCkfwYKYQMVKFoEcqLFm4Tg49XnPcA&m=f-xS8GMHKckknkc7Xpp8FJYw_ltUwz5frOw1a5pJ81EpdTOK8xhbYmrN4ZxniM96&s=9cFzZ5HZsLK7ML6fRuCQqu7cakKiK5mvW9czOHOTXXM&e=.

PERMALINK

Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT

Shahab Sarmashghi

Metin Balaban

Eleonora Rachtman

Behrouz Touri

Siavash Mirarab

Vineet Bafna

Roles

Abstract

Author summary

Introduction

Estimating genome repetitiveness and other parameters using k-mers

Fig 1. Characterizing repeats at k-mer level.

Results

A simple model for estimating repeat spectra from unassembled data performs poorly

Fig 2. Repeat spectra estimation.

Overview of RESPECT algorithm

Estimating genome lengths

Fig 3. Iterative estimation of genome length.

Estimating genome length using sequenced short reads

Fig 4. Estimating genome length using SRA data.

Table 1. Comparing RESPECT and CovEst accuracy on SRA’s of highly repetitive genomes.

The role of WGD versus high copy repeat elements in shaping genome repeat structure

Fig 5. High copy repeats per million versus uniqueness ratio among genomes with and without known recent WGD events.

Discussion

Methods

Comparing r1/L distribution over different sets

Modeling genomic parameters

A generic iterative optimization for parameter estimation

Least-squares estimate of repeat spectrum

Linear programming for constrained optimization based estimates

Spline Linear programming

RESPECT algorithm

SRA preprocessing and contamination filtering

Implementation details and running time

Selecting species with known recent WGD events

Statistical analysis of the repeat structure

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Nicola Segata

Sushmita Roy

Roles

Author response to Decision Letter 0

Decision Letter 1

Nicola Segata

Sushmita Roy

Roles

Acceptance letter

Nicola Segata

Sushmita Roy

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases