Skip to main content
BMC Genomics logoLink to BMC Genomics
. 2025 Sep 29;26:878. doi: 10.1186/s12864-025-12031-9

A closed-loop method for precise genome size estimation using HiFi reads

Jianfeng Xing 1,2,3, Jiangshan Hao 5, Chaorong Tang 1,2,3, Shangqian Xie 4,, Kaiye Liu 1,2,3,
PMCID: PMC12482813  PMID: 41023799

Abstract

Background

Super pangenomes, as complete genome sequencing at the genus level, have provided new insights into the speciation and evolution of functional genes. Genome size (GS) estimation is a critical first step. Although K-mer-based GS evaluators are applied extensively to guide genome assembly process and quality assessment, the results vary substantially with the tools and parameters used, presenting challenges for genus-level genome studies.

Results

Here, we investigated K-mer spectra from datasets of species with and without whole genome duplication, revealing that the trade-off in K-mer length amplified the signal of genomic characteristics related to repeat content or heterozygosity. Moreover, GS predictions were influenced by genomic heterozygosity and sequencing accuracy when different K-mer lengths were employed. In contrast, consistent GS predictions were obtained across all HiFi-based evaluations, demonstrating high accuracy of the derived limiting values from the regions of GS evaluation convergence during continuous variation of K. Unlike traditional methods that rely on single predictions, we introduced a closed-loop GS-estimating framework, that incorporates steady-value calculations, leveraging the continuity and accuracy of HiFi reads. Finally, we developed a high-performance pipeline, LVgs (https://github.com/xingjianfeng100/LVgs), by integrating FastK and GenomeScope 2.0.

Conclusions

The robustness and applicability of LVgs for genus-level species was demonstrated through its application to various diploid and polyploidy species.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12864-025-12031-9.

Keywords: Genome size estimation, HiFi reads, K-mer lengths, LVgs

Background

Genome size (GS) varies dramatically across eukaryotes, spanning at least five orders of magnitude, providing insights into the evolution of large-scale genomic properties [16]. GS variation in closely related Caenorhabditis nematodes is associated with different modes of sexual reproduction [7, 8]. In plants, GS differences among species with the same genetic background have been associated with homoploid hybridization and gene flow in Anacyclus [9, 10]. Similar phenomenon has been documented in economically important crops such as Oryza sativa, where genome evolution driven by transposable elements (TEs) has led to nearly 100 Mb of GS variation between O. sativa and its wild relatives [11]. Notably, ploidy evolution at the genus level, driven by whole genome duplication (WGD) or interspecies hybrids, has resulted in severalfold differences in total nuclear DNA contents among genus-level species, despite their similar monoploid genome sizes (1Cx) [12]. Therefore, assessing GS in closely related species could enhance our understanding of genome evolution among relative species and facilitate the use of more distantly related species as genetic resources. However, the underrepresentation of most species, particularly in plants, has limited our genomic understanding, primarily due to the scarcity of whole-genome sequence information [13]. Conversely, increasing genomic studies involving multiple representative species sampled at the genus level as super pangenomes have substantially advanced our understanding of key issues in plant genetic diversity, evolution, domestication, and molecular breeding [14]. For the genera with significant variation in GS, prior genomic data from only a few species are insufficient for constructing a genus-level representative super pangenome. Therefore, a robust and precise method that considers factors like heterozygosity, repetitive sequences and polyploidy within the same genetic background is essential for effective GS evaluation. This method should also serve as prior information to guide genome assembly and act as a benchmark for assessing genome completeness and redundancy in super pangenome research.

Several laboratory procedures requiring specific genomes which are used as internal size standards, including Feulgen image analysis densitometry and flow cytometry which was the current gold standard, measure DNA amounts directly and have been extensively employed for offering precise and reliable GS estimates [1, 15]. On the other hand, the emergence of next generation sequencing (NGS) has enabled bioinformatic estimates of GS based on sequencing data [16]. A key subset of these approaches were based on sequence-depth with reference-mapping dependent, such as ModEst [17] and LocoGSE which depend on single copy genes dateset [18], as well as MGSE which can leverage long read sequencing technologies [19]. A significant complementary category of de novo methods, which can be reference-free [20], has been developed to infer GS by analyzing K-mer frequencies and plotting K-mers spectra. A widely adopted calculation method involves dividing the total K-mers number by the K-mers coverage [21]. Although the final GS evaluation for the same species can vary, attributed to interindividual-specific genomic differences or influence by specific properties of a sequencing library, such as sub-optimal sample conditions, DNA isolation, library preparation and so on, biased GS estimation of K-mers based strategy frequently occurs. This bias is evident when comparing different K-mer evaluators using the same sequencing datasets [22]. Such discrepancies are likely due to inaccuracies in K-mers coverage or total K-mers counts in complex genomes, particularly those with WGD. These factors can obscure the source location of original K-mers, as substrings from similar genomic regions may be identical. Furthermore, this issue is exacerbated by genomic heterozygosity and sequencing errors. Employing long K-mers from highly accurate sequence reads, while constraining length to avoid critically low k-mer abundance and suboptimal coverage, may enhance precise positioning and improve K-mer coverage estimation in complex genomes. However, most high-throughput and high-accuracy NGS platforms generated short reads with length typically ranging between 50 and 300 bp [23], which limited the ‌experiment‌ of large length K-mers. Additionally, shorter reads at same sequencing depth yield fewer K-mers counts for a given K-mer length, which further diminished the accuracy of the statistical analyses [21].

The recently developed third-generation sequencing (TGS) technology from PacBio offers long and high fidelity (HiFi) reads, enabling the generation of robust K-mers across a wider length range [24, 25]. This makes HiFi reads suitable for complex genome [26]. Additionally, ONT sequencing offer ultra-long reads that facilitate telomere-to-telomere (T2T) assembly. Recent technological advancements in ONT sequencing have significantly improved its accuracy and throughput, promising future applications in genome size estimation [27]. In this study, we compared the K-mer spectra between low-level and high-level K values, as well as between species with and without WGD. The results reveal the sensitivity trade-off of K value in evaluating genomic features, where short K-mers enhance signals of genomic repeat characteristics, while long K-mers amplify signals of genomic heterozygosity. These results indicate that single K-mer-based estimation is insufficiently reliable, necessitating a comprehensive approach for ‌accurate‌ GS evaluation. Here we constructed a pipeline named LVgs (https://github.com/xingjianfeng100/LVgs), incorporating the K-mer counter FastK and GenomeScope 2.0. The robustness and accuracy of LVgs were confirmed using ten HiFi datasets with consensus reads or re-corrected reads, encompassing heterozygous diploid, homozygous diploid, autotriploid, autotetraploid and allotetraploid samples. Moreover, we demonstrated the applicability of this tool through a practical example with Allium species, highlighting its effectiveness for genus-level GS estimation.

Results

The sensitivity trade-off of K value between repeat and heterozygosity detection

An anomalous phenomenon was observed in the 15-mer distribution pattern, where a polyploidy feature appeared instead of the expected diploid pattern, based on diploid human HiFi datasets (Supplementary Table 1a-1b, Supplementary Fig. 1). It was initially presumed that this anomaly was due to the substandard length of K-mers, as smaller K values increase the likelihood of random K-mer similarity. For a 3 Gb human genome, K should ideally be greater than 19 [21, 28]. However, K-mer distribution patterns are also considered as the reflections of inherent genomic characteristics [20], and repetitive components may contribute to the polyploidy illusion observed in K-mer spectra. Thus, we speculated that multi-copy regions with at least twice-occurrence in genome, accounting for approximately 2% of the human genome [29], are sensitively reflected in the 15-mers spectra. Next, we checked the K-mers histograms of human genome at the low K values. By using multi-K values for predicting, we could easily distinguish the heterozygous and homozygous peaks (Fig. 1A, Supplementary Fig. 2, Supplementary Tables 2–5). Notably,‌ the peaks exhibiting twice the coverage of homozygous peak remained sufficiently pronounced to be detectable even at K = 27. This observation led us to speculate that a substantial portion of two-copy homozygous K-mers from the WGD region, rather than short K-mer effects, resulted in the polyploidy features at low K levels. The human genome is known to have undergone two rounds of WGD in the ancestral vertebrate [3033], however, these events are too ancient to retain long syntenic fragments (Supplementary Fig. 3 A). Only minor traces were observed by detecting 114 intra-genomic synteny blocks in the human genome (Supplementary Fig. 3B, Supplementary Table 6), suggesting some remnants of these ancient events remain. For further validation of the potential link between the low-K level spectrum and WGD, K-mers histograms based on HiFi reads of A. thaliana Col-0 (Supplementary Table 1c), which has experienced three ancient WGDs known as the alpha-beta-gamma series [34, 35], were analyzed. A peak corresponding to two-copy homozygous K-mers was also observed in the low-K level histogram of A. thaliana Col-0 (Fig. 1D, Supplementary Fig. 4, Supplementary Tables 8 and 9). Moreover, we extracted 1,123,110,433 Bp (~ 37%) and 92,670,702 Bp (~ 67%) collinearity blocks sequences from CHM13 and Col-0 genomes, respectively (Supplementary Tables 6–7). After surveying the frequency histograms of 15-mers and 17-mers, we found that the ratios of K-mers shared between collinearity blocks and HiFi reads in the two-copy homozygous K-mers peaks were 63–64% and 89–95%, significantly higher than those in the homozygous K-mers peaks (Fig. 1B, C, E and F). This result demonstrates that two-copy homozygous K-mers are enriched in collinearity block regions. These findings indicate that short K-mers are more effective in detecting genomic characteristics associated with repeat components.

Fig. 1.

Fig. 1

K-mer spectra with varying K values revealed the trade-off between repeat and heterozygosity detection. (A and D) Distribution patterns of K-mers with different lengths from corrected HiFi reads in two diploid WGD-experienced genomes, including human HG00733 (A) and Arabidopsis thaliana Col-0 (D); (B-C and E-F) Comparison of K-mer spectrum between WGD-derived segments and HiFi reads datasets, (B) 15-mers in human CHM13, (C) 17-mers in human CHM13, (E) 15-mers in A. thaliana Col-0, (F) 17-mers in A. thaliana Col-0; (G-H) Distribution patterns of K-mers with different lengths from Illumina reads in two diploid WGD-unexperienced genomes, including Anthoceros agrestis (G) and Chloropicon primus (H); lines or dots with different colours represent different K values, of which the detailed legends are on the right of diagram; peaks are pointed by grey arrows, ‘Het peak’ means heterozygous K-mers peak, ‘Hom peak’ means homozygous K-mers peak, ‘2Inline graphic Hom peak’ means K-mers peak with twice coverage of ‘Hom peak’. (I) A balance scale as the analogy to the trade-off of K value between repeat detecting and heterozygosity detecting. In the left tray, principle of genomic repeat detecting affected by changing K is illustrated by the concise schematic diagram with two simulative WGD sites and simulative 4Inline graphic dataset generating different proportion of 8Inline graphic K-mers and 4Inline graphic K-mers in two scenarios of large and small K; as K decreases from 16 to 8, the site with base mutation generates two types of 8-mers with 4Inline graphic coverage, while the site without base mutation generates only one type of 8-mers with 8Inline graphic coverage; the diagram demonstrates K-mers with high coverage will increase as K decreases, amplifying characteristic signal of repeat. In the right tray, principle of genomic heterozygosity detecting affected by changing K is illustrated by the concise schematic diagram with two simulative heterozygous or homozygous sites and simulative 4Inline graphic dataset generating different proportion of 4Inline graphicK-mers and 2Inline graphic K-mers in two scenarios of large and small K; as K increase from 8 to 16, the homozygous site generates only one type of 16-mers with 4Inline graphiccoverage, while the heterozygous site generates two types of 16-mers with 2

Fig. 2.

Fig. 2

Asymptotic genome size estimation of HG00733 via K-mer length progression. A Genome size predictions in rising‌ K-mer length; brown and pink dashed horizontal line indicated the limiting value of genome size predictions and the reported assembly size [42] which were on the right; and the P-value of Dickey-Fuller test was on top left of the chart. B The model matching rates with the K-mer spectrum in rising‌ K-mer length; the model was the 2 × p negative binomial distributions which were used to predict genome size by GenomeScope [25], where p is the ploidy. C The ‌inferred‌ average K-mer coverages for the diploid genome in rising‌ K-mer length, which were used to calculate the genome size. D-F The ‌inferred‌ unique lengths, the ‌inferred‌ read error rates and the ‌inferred‌ repeat lengths in rising‌ K-mer length. G-H Untransformed linear K-mer spectrum with 417-mer and 427-mer plotted by GenomeScope2

Fig. 3.

Fig. 3

Genome size evaluation of HG00733 with corrected HiFi reads and different depth datasets. A Genome size predictions in rising‌ K-mer length with corrected HiFi reads; brown and pink dashed horizontal line indicated the limiting value of genome predictions and the reported assembly size [42] which were on the right; and the P-value of Dickey-Fuller test was on top left of the chart. B The inferred‌ read error rates in rising‌ K-mer length. C-D Untransformed linear K-mer spectrum with 417-mer and 427-mer plotted by GenomeScope2 using corrected HiFi reads. E Collapse of genome size estimation within the K range from 297 to 507 using three depth levels datasets; orange line with empty circles represents ~ 26Inline graphic sequence depth, grey line with solid squares represents ~ 30Inline graphic sequence depth and yellow line with hollow rhombi represents ~ 35Inline graphic sequence depth

Fig. 4.

Fig. 4

Asymptotic genome size evaluation of human homozygous genome CHM13 without evaluation collapse. A-F, G-L and M-R Genome size evaluation of CHM13 by searching the limiting value of genome predictions in ‌continuous‌ ascent of K-mer length with the parameter “-l 29”, corrected HiFi reads and the parameter “-p 1”. Among these: (A), (G) and (M) Genome size predictions in rising‌ K-mer length; brown and pink dashed horizontal line indicated the limiting value of genome size predictions and the reported assembly size [47] which were on the right; and the P-value of Dickey-Fuller test was on top left of the chart; (B), (H) and (N) The ‌inferred‌ average K-mer coverages for the diploid genome in rising‌ K-mer length, which were used to calculate the genome size; (C-F), (I-L) and (O-R) Untransformed linear K-mer spectrum with 45-mer, 50-mer, 130-mer and 135-mer plotted by GenomeScope2

In addition, to verify the connection between shorter K-mer spectra and WGD from a different angle, Illumina datasets (Supplementary Table 1k-1 L) from Anthoceros agrestis [36] and Chloropicon primus [37], which currently lack evidence of WGD, were collected to compare their low-K level spectra (Supplementary Tables 10 and11, Fig. 1G and H). As expected, the peak corresponding to two-copy homozygous K-mers was absent in A. agrestis and C. primus. To eliminate potential errors from sequencing technology, we validated the two-copy homozygous peaks (Supplementary Fig. 5, Supplementary Tables 12–13) in K-mer spectra using Illumina datasets (Supplementary Table 1 m-1n) from human and A. thaliana Col-0. These results implied that alongside synonymous substitution, gene tree, and synteny-based methods [3840], K-mer spectrum analysis provides a theoretical framework for detecting WGD. We noticed the peaks with lower coverage were strengthened with increasing K, a phenomenon particularly evident in the high-K level spectra. This trend resembles the amplification of heterozygous peaks by long K-mers (see section two of the Results) rather than a general coverage dilution effect of long K-mer. This pattern was more clearly visualized in K-mer histograms from corrected HiFi datasets, where no leftward shift of peaks was observed (Fig. 1A and D). Based on the above results, it can be concluded that decreasing K in the K-mer spectrum is more effective for evaluating the repeat component of genomic characteristics, while increasing K is preferable for assessing heterozygosity components. This trade-off in K value effect can be ‌likened to a balance scale, where the K value serves as the balance rider, leading the mechanism to tilt towards either repeat detecting or heterozygosity detection (Fig. 1I). Our findings also indicated that K-mer spectra with different K values are affected by the repeat and heterozygosity features. Therefore, it is necessary to conduct whole-scale K predictions rather than relying on single K prediction, as the latter is insufficiently reliable for accurate GS evaluation.

Fig. 5.

Fig. 5

Asymptotic genome size estimation of complex genomes via K-mer length progression. A-C Genome size evaluation for genomes with big size, including Allium cepa with HiFi reads (A), A. fistulosum with HiFi reads (B) and A. sativum with corrected HiFi reads (C). D-F GS evaluation for polyploidy genomes with corrected HiFi reads, including autotriploid Musa acuminata (D), autotetraploid Solanum tuberosum (E) and allotetraploid Agapanthus africanus with asymmetrical chromosome numbers between subgenomes (F). brown and pink dashed horizontal line indicated the limiting value of genome size predictions and the reported assembly size [22, 57, 58] which were on the right; and the P-value of Dickey-Fuller test was on top left of the chart. “1Cx” means monoploid genome size that the amount of DNA in one chromosome set [64]

Table 1.

Multi-source and multi-method genome size evaluations for four species of the family Amaryllidaceae

Allium cepa Allium fistulosum Allium sativum Agapanthus africanus (1 C)
LVgs (Gb) 15.61 10.82 15.64 10.34
Published in other study (Gb)
 Assembly size [22] 15.78 11.17 15.52 10.48
 Flow cytometry [22, 6163] 15.88 11.47 15.43 11.93
Based on K-mer from Illumina [22]
 Kmergenie 13.40 9.80 20.73 9.80
 GenomeScope K = 19 10.68 7.04 19.40 4.28
K = 21 11.32 7.42 20.73 4.72
K = 23 11.80 7.72 21.67 5.10
K = 25 12.20 7.96 22.47 5.46
K = 27 12.54 8.18 23.32 5.78
K = 29 12.83 8.37 23.96 6.09
K = 31 13.09 8.54 24.53 6.36
 GenomeScope 2.0 K = 19 5.30 3.46 8.93 4.23
K = 21 5.62 7.43 9.35 4.67
K = 23 5.86 7.73 9.54 5.06
K = 25 12.21 7.97 9.69 5.41
K = 27 12.54 8.19 9.78 5.72
K = 29 12.84 8.38 9.83 6.03
K = 31 13.09 8.55 10.19 6.30
K = 41 \ 9.20 \ \
K = 43 \ \ \ 7.52
K = 55 14.71 \ \ \
 GCE K = 19 6.91 8.79 27.30 4.87
K = 21 7.88 9.93 30.00 5.61
K = 23 8.55 10.60 32.70 6.31
K = 25 9.12 11.30 34.40 6.90
K = 27 9.62 11.80 36.10 7.56
K = 29 9.98 12.50 37.50 8.21
K = 31 10.40 12.80 39.80 8.72

For the allotetraploid Agapanthus africanus with asymmetrical chromosome numbers between subgenomes, the 1 C value was previously reported in other study. For consistency, the average 1Cx value outputted by GenomeScope 2.0 in this study was converted to 1 C value by multiplying it by 2. “1 C” represents holoploid genome size, defined as total amount of DNA in the unreplicated haploid nucleus [64]; and “1Cx” denotes monoploid genome size, corresponding to the DNA content of a single chromosome set [64]

Accurate evaluation and sudden collapse coexisted during loop predictions

A GS prediction loop was constructed using the published HiFi reads dataset [41] of human HG00733 (Supplementary Table 1a). By incrementally increasing the K-mer length from 17 to 577, the results revealed that low-level length of K-mer produced GS estimates with high variability, while stability was observed once K-mer length exceeded 77, as demonstrated by tendency curve of GS values (Dickey-Fuller test; P-value = 0.014). After determining the limiting value, the final GS estimation for HG00733 was found to be 2,938 Mb, closely aligning with the reported assembly size [42] (Fig. 2A and Supplementary Table 14). Apart from the overall trend of the GS curve, the optimal value can also be determined by evaluating the trend of model matching rates as an auxiliary index, which serves to assess the reliability of the GS predictions (Fig. 2B). It was reasonable for the total number of K-mer to decrease as the K-mer increases (Supplementary Fig. 6 A). In a hypothetical scenario with non-repetitive genome, each original K-mer occurs only once, and the theoretical number of K-mer types would be nearly equal to the total number of K-mers divided by the sequencing depth [21]. Under these conditions, the trend in the number of K-mer types should closely follow the trend in the total number of K-mer. However, an opposite trend was observed in the number of K-mer types as K-mer length increased (Supplementary Fig. 6B). This result suggests that longer K-mers, which are more likely to span repetitive segments, identified an increasing number of repeat-associated K-mers as unique. This phenomenon might also be attributed to the higher probability of incorporating sequence errors in longer K-mers, resulting in an increased number of erroneous K-mers [21]. However, the opposite trend persisted for K-mers sizes up to 200, even after minimizing the impact of sequence errors by recounting the K-mers with frequencies greater than eight (Supplementary Fig. 6 C and 6D). Accordingly, the ‌inferred‌ average K-mer coverage (Fig. 2C) and the ‌inferred‌ repeat length (Fig. 2F) decreased, while the ‌inferred‌ unique length (Fig. 2D) increased as K-mer size increased. These results suggest that some short K-mers may have incorrectly classified as the repeat K-mers, inflating the inferred average K-mer coverage, until the K-mer size exceeded a certain threshold, possibly 200 bp in this case.

Two cyclical patterns were observed in the ‌inferred read error rates, characterized by an initial spike followed by gradual decrease (Fig. 2E). For a dataset with fixed sequence error rates, an increasing K-mer size result in a higher probability of generating erroneous K-mers [21], which logically accounts for the initial spike. However, the subsequent decrease was unexpected and remains unexplained. After examining the K-mer spectra at five critical K-mer sizes (Supplementary Fig. 6E-6 H and Fig. 2G), ranging from 37-mer to 417-mer, we observed a notable influence of K-mer length on the heterozygous peak. As the K-mer length increased, the heterozygous peak became more pronounced, while its coverage decreased. In addition, considering the pivotal role of error K-mers in calculating error rate by formula: Inline graphic (https://github.com/tbenavi1/genomescope2.0/blob/master/R/report_results.R, line 245), we examined trend in the count of error K-mers with increasing K. Moreover, we restructured equation: Inline graphictotal_kmers, to calculate error K-mers at three steady error rates: 2.2‰, 1.7‰ and 1.3‰. Due to a higher probability of incorporating sequence errors, longer K-mers tend to generate more erroneous K-mers. Consequently, an overall increasing trend is observed in four types of “total_error_kmers” as K values rise (Supplementary Table 15). Nevertheless, the growth of HiFi reads exhibits progressive deceleration compared to others of the three steady error rates (Supplementary Fig. 7). This divergence in the growth trajectories resulted in a reduction of the computed read error rate, which should have remained constant. We hypothesize that the rise in error K-mers due to increasing K reduces the number of non-erroneous K-mers, leading to a subsequent decrease in K-mers coverage. This is evident across K-mer spectra, where peaks shift leftward with increasing K and corresponding inferred coverage gradually declines (Fig. 1G and H, S2A, S2C, S4, S5, Table S14, S19, S21, S23). This trend shows deceleration or remains indistinct in K-mer spectra from corrected reads (Fig. 1A and D, S2B, Table S15, S18), further supporting our hypothesis. Consequently, the boundary between error K-mers and non-erroneous K-mers blurs with increasing K, making it increasingly challenging for the GenomeScope mixture model to distinguish error K-mers from low-frequency K-mers.

The second cycle period began with a spike at 427-mer, accompanied by dramatic fluctuations in other parameters (Fig. 2A and F), including a sharp increase of ‌average K-mer coverage and a steep decline of GS, unique length and repeat length. By comparing the K-mer spectra at two critical K-mer sizes (Fig. 2G and H), we found that the heterozygous peak with low coverage was in close proximity to the erroneous K-mers sets. As a result, the first peak was excessively adjusted by GenomeScope2 after four rounds of model fitting, leading to an expanded coverage of candidate error K-mers, with many K-mers from the original heterozygous peak being assigned as errors by model. This not only caused a significant loss of unique K-mers but also led to the complete exclusion of the real heterozygous peak. Consequently, the coverage of the homozygous peak was incorrectly used as the average K-mer coverage for GS calculation by GenomeScope2, which accounts for the observed reduction in GS by more than half and the nearly doubling of average K-mer coverage. The result suggests that longer K-mers length can effectively distinguish unique K-mers from repetitive or error ones, providing more accurate predictions closer to actual assembly size. However, once K exceeds a critical threshold, it leads to a collapse of GS due to abandonment of the actual‌ heterozygous peak.

Robust framework of iterative predicting with tolerant of collapse under multi-factor influences

Excessively long K-mers resulted in overlapping coverages between erroneous and heterozygous K-mers, triggering a collapse in GS estimation. This phenomenon was evident in the K-mer spectra, where the heterozygous peak progressively approached the erroneous K-mer set. To mitigate this issue, we reduced the proportion of erroneous K-mers and increased sequencing depth, effectively separating the heterozygous peak from the erroneous K-mer set and delaying the collapse of GS estimation. After further re-correcting the consensus reads of HiFi dataset using hifiasm [43], we re-evaluated the GS of HG00733 through iterative loop-based predictions (Fig. 3A and Supplementary Table 16). The GS trend converged to‌ a limiting value closer to the actual assembly size, and the curve exhibited increased stationarity (Dickey-Fuller test; P-value < 0.01). Notably, no secondary spike-ebb patterns or dramatic fluctuations were observed in any other items (Fig. 3B and Supplementary Fig. 8), even at K value of 1000. The 417-mer and 427-mer spectra displayed more distinct bimodal distributions (Fig. 3C and D) compared to those generated from consensus-only (termed as “uncorrected” in this study) HiFi reads (Fig. 2G and H). The K-mers derived from more accurate reads not only reduced erroneous K-mers but also enhanced the coverage of the heterozygous peak (Fig. 3C and D). ‌Besides, we investigated the impact of sequence depth on the collapse of GS predictions, generating low-depth datasets at 30Inline graphicand 26Inline graphicby randomly selecting six and five sub-datasets, respectively, from the total seven datasets (Supplementary Table 1a) for iterative predicting (Fig. 3E). The results revealed that collapse occurred in low-depth datasets, with the threshold K shifting leftward from 417 to 407. Additionally, the limiting value of GS exhibited a slight deviation from the actual assembly size (Supplementary Table 17). These findings suggested that increased sequence depth could contribute to the larger collapse-tolerance threshold K and bring the limiting value closer to the true GS‌. Furthermore, we enhanced the robustness of Genomescope against low K-mer coverage during single-pass evaluation optimization by minimizing model fitting rounds, maintaining convergent GS evaluation for HG0073 until K exceeded 587, which led to extremely low coverage of heterozygous peaks (Supplementary Fig. 9 and Supplementary Table 18). These results support the notion that using longer K-mers for GS predictions leads to increased accuracy until the K-mers length exceed the threshold value. However, assigning a single universal threshold for all scenarios is scientifically indefensible, as the optimum K is affected by multiple factors, including‌sequence errors, sequencing depth and genomic complex features [44]. Our strategy could bypass the challenge by identifying the threshold K value through the analysis of GS trends in a series of loop predictions, ultimately determining the optimal evaluation by searching the limiting value. Based on the strategy, we have developed a custom pipeline named “LVgs” (https://github.com/xingjianfeng100/LVgs) to facilities GS evaluation through iterative predicting with increasing K-mer lengths. This pipeline integrated the K‑mer counter FastK (https://github.com/thegenemyers/FASTK)) and the widely used K-mer-based GS estimation tool, GenomeScope 2.0 [25], with an optional step for re-correcting HiFi reads.  

Performance of LVgs in homozygous diploid organisms

The extent of genome heterozygosity varies among species within a genus, particularly in crops, that are more influenced by artificial domestication. During this process, the heterozygosity of crops often diverges from that of their wild relatives, exhibiting two distinct patterns: (1) an increase in heterozygosity to exploit heterosis, or (2) a decrease in heterozygosity, potentially leading to the development of homozygous inbred lines to stabilize desirable traits [45, 46]. However, LVgs utilized the high-performing evaluator GenomeScope2 for each prediction [25], reporting only half the 1Cx size for homozygous genome [21]. To assess the performance of the LVgs in homozygous genomes, we used the HiFi datasets (Supplementary Table 1b) from the complete androgenetic hydatidiform mole CHM13hTERT cell line (CHM13) [47], to assess the GS of the nearly homozygous human genome (Supplementary Fig. 10 and Supplementary Table 19). Our strategy consistently adhered to the standard that the limiting value approximates the true Telomere-to-Telomere (T2T) assembly size (Fig. 4A). Despite the extremely low‌ heterozygosity of CHM13, which only includes a few thousand heterozygous variants and a megabase-scale heterozygous deletion within the rDNA array on chromosome 15 [47], GenomeScope2 sensitively detected the heterozygous peak in most K values (15–45, 135–250, Supplementary Table 19). Nevertheless, collapses persisted in the loop predictions, notably occurring in the GS curve before the convergence at the limiting value, particularly at low K level. Similar to the pattern seen in the heterozygous HG00733 genome, the observed collapse to the half ‌ of the limiting value in homozygous CHM13 was further supported by corresponding changes in other auxiliary metrics (Supplementary Fig. 10B and Supplementary Fig. 11). Notably, these included a doubling of the inferred average K-mer coverage, a concurrent decline in both the inferred unique length and repeat length, and an increase in inferred read error rates. These indications led us to speculate‌ that the loss of the heterozygous peak also contributed to this collapse. Further validation through K-mer spectrum analysis at four critical K-values (Supplementary Fig. 10C-F) supported this ‌speculation. As the K-mer length increased, the decrease in coverage further diluted the originally‌ weak peak. When K exceeded 45, the remaining heterozygous K-mers were discarded as error K-mers by the GenomeScope 2.0 model, leading to the loss of heterozygous peaks. Only when K exceeded 135 did longer K-mers generated more heterozygous K-mers. At this point, the impact of the heterozygous K-mers on coverage outweighed the dilution effect of error K-mers, enabling accurate GS predicting. To address the collapse of the heterozygous HG0073 genome, we first manually specified the K-mer coverage at 29, yielding asymptotically convergent evaluations (Fig. 4A and F, Supplementary Table 20). In addition, re-corrected HiFi reads were used instead of consensus reads in each prediction, effectively reducing error K-mers (Fig. 4G and L, Supplementary Fig. 12 and Supplementary Table 21). This improvement in read accuracy also proved effective for homozygous genome. Comparisons between K-mer spectra derived from consensus HiFi reads (Fig. 4C and F) and those from corrected HiFi reads (Fig. 4I and L) illustrated that reducing error K-mers enhanced sensitivity for detecting extremely weak heterozygous peaks.

Given the barely detectable heterozygous peak, the approach for handling homozygous or low-heterozygosity genomes should be adjusted compared to other diploids. Furthermore, upon reviewing the code of GenomeScope 2.0, we found the equation for final GS calculation was Inline graphichttps://github.com/tbenavi1/genomescope2.0/blob/master/R/report_results.R, line 250) where “p” represented ploidy. Consequently, the loss of heterozygous peak, which was replaced by the homozygous peak in the diploid GS prediction, exactly‌ reduced the GS to half of the accurate value. To test this hypothesis, we simulated‌ that CHM13 as monoploid by setting the parameter “-p 1” in GenomeScope 2.0 for each prediction. This adjustment allowed the effective identification of homozygous peak across all K-values. As a result, although the average K-mer coverage remained twice that of the heterozygous peak, the GS predictions converged to a limiting value close to the T2T assembly size, without any collapses (Fig. 4M and R, Supplementary Fig. 13 and Supplementary Table 22). However, the heterozygous peaks were entirely overlooked, and some heterozygous K-mers were likely misclassified as error K-mers (Fig. 4O and R), leading to a reduction of the total number of unique K-mers. Consequently, the deviation (70–90 Mb) from the limiting value, compared to the previous two treatments (Fig. 4A and G), was‌‌ unsatisfactory. Similarly, the deviation (~ 100 Mb) between the limiting value and the actual T2T assembly size was also concerning (Fig. 4M). Additionally, two HiFi diploid datasets (Supplementary Table 1c and 1 d) from homozygous plants, Arabidopsis thaliana Col-0 [48] and Oryza sativa ssp. japonica cv. Nipponbare [49], were used to evaluated the GS using this homozygote-specific method. In both cases, the method successfully avoided collapse and achieved reasonable limiting values (Supplementary Fig. 14I-14P, Supplementary Fig. 15I-15P, Supplementary Tables 24 and 26). Although this monoploid-faking method appears effective for homozygous or low-heterozygosity genome resistant to collapse, it does not meet the increasing demand for the desired knowledge of the organisms with unknown background, and determining the threshold value for low heterozygosity suitable for this method remains challenging. Fortunately, our strategy of wholescale-K predictions has demonstrated robust capability in GS evaluation for diploid with even slight heterozygosity, as evidenced by its application in humans (Fig. 4A and L), A. thaliana (Supplementary Fig. 14A-14 H and Supplementary Table 23) and O. sativa (Supplementary Fig. 15A-15 H and Supplementary Table 25). In fact, organisms with complete homozygosity are rare in nature, thus, LVgs can serve as a reliable tool for the diploid, whether their genomes are homozygous or heterozygous.

Performance of LVgs for complex genomes

The genomes of non-model species with higher ploidy or large sizes are increasingly becoming a focus in molecular genetics research, offering new insight into species evolution, variation mechanism and inheritance patterns [22, 5058]. However, a significant portion of these genomes remains undeciphered [59, 60]. Thus, the robustness of LVgs in surveying complex genomes has also been assessed. We specifically evaluated the GSs of three Allium species (Fig. 5A and C, Supplementary Fig. 16–18, Supplementary Tables 27–29), each exceeding 10 Gb [22] using uncorrected or re-corrected HiFi reads (Supplementary Table 1e-1 g). The predicted GS values from LVgs were compared with those reported in previous studies [22, 6163]. The deviations from assembly size ranged from 0.8 to 3%, and while those compared to flow cytometry results ranged from 1 to 6% (Table 1). In contrast, larger deviations were observed with other K-mer-based methods. For the same evaluator, GenomeScope 2.0, the use of Illumina datasets with short reads length prevented the availability of large-size K-mers, thereby limiting more accuracy of prediction (Table 1). These results not only demonstrate the reliability of LVgs for large genomes but also highlight its superiority over other strategies. Notably, GenomeScope 2.0, as implemented in LVgs, was innovatively designed for estimating GS in polyploid organisms, and its performance has been validated with both simulated and real datasets [25]. This further strengthens our confidence in the applicability of LVgs for polyploid genomes. However, the reliability of LVgs for autopolyploids has yet to be verified. Hence, HiFi datasets of the autotriploid bananas [57] (Musa acuminata) and autotetraploid potato [58] (Solanum tuberosum) were collected (Supplementary Table 1h-1i) to assess the performance of LVgs. The deviations of 1 C values from assembly size were 4% and 6% respectively (Fig. 5D and E, Supplementary Fig. 19–20, Supplementary Tables 30–31), demonstrating that LVgs can accurately estimate the GS of autopolyploid. Additionally, the uneven chromosome numbers among haplotype sets in polyploidy may pose challenges to conventional GS prediction methods, which rely on average K-mer coverage derived from peaks in bell-shaped K-mer spectra. We evaluated the ability of LVgs to handle these complexities using the HiFi dataset (Supplementary Table 1j) of Agapanthus africanus (2n = 4x = 30), a complex allotetraploid with asymmetrical chromosome numbers between subgenomes [22]. LVgs estimated the average 1Cx value of A. africanus to be 5,172 Mb (Supplementary Fig. 21, Supplementary Table 32), with a deviation of only 1% from the assembly size (Fig. 5F). By multiplying the 1Cx value by 2, the GS was converted to 1 C value and compared with results from other strategies (Table 1). LVgs’ estimation for A. africanus aligned more closely with the published assembly size of the high-quality genome [22] than the result of flow cytometry, highlighting its superiority over other K-mer-based methods. In conclusion, LVgs has demonstrated reliability in handling complex genomes, and the example of Allium species further illustrates its suitability for effective GS evaluation at the genus level.

Discussion

K-mer-based methods leverage high-throughput sequencing for more convenient estimation of GS [21]. However, the complex genomic characteristics of non-model organisms, such as a high proportion of repeat sequences, huge genome sizes and polyploidy, pose significant challenges to accurate and reliable GS estimates using K-mer-based approaches [22]. As interest in genus-level species genome research has increased in recent years [14], there is a growing demand for a convenient K-mer-based GS tool specifically designed for GS estimation in these studies. Our strategy, employing two remarkable tools, FastK and GenomeScope 2.0, established a GS closed-loop predicting pipeline that incrementally increasing K-mer length using HiFi reads. By comparing K-mer spectra with low K values between WGD-experienced and WGD-unexperienced species, we identified a connection between the frequency distribution of short K-mers and WGD event. Additionally, the differences in K-mer spectra between low and high K values highlight a trade-off in K-value selection: lower K values enhance sensitivity to genomic repeats, while higher K values improve detection of heterozygosity. These results demonstrated that K-mer spectra would be influenced by both K value and genomic features, such as repeats and heterozygosity, suggesting that reliance on a single prediction may lack robustness. Moreover, we found single prediction probably underestimated the 1Cx value by approximately half in both diploid heterozygotes and homozygotes, resulting in obvious collapse segments in the stationary GS tendency curve. Through loop prediction analyses using generated‌ K-emr spectra, GS predicted values, and other auxiliary indexes, we deduced that the evaluation collapse was primarily caused by misfitted K values and the loss of the heterozygous K-mer peaks. However, defining an appropriate K range proved challenging, due to various factors, including heterozygosity, sequencing accuracy, and sequencing depth. For homozygous or low-heterozygosity genomes, we proposed the monoploid-faking method, which effectively facilitates the adoption of single prediction. However, this method is generally considered inapplicable, as the fundamental genomic feature of heterozygosity is typically unknown under normal circumstances. Determining the effective threshold of low heterozygosity for the monoploid-faking method was challenging, further limiting its practical application. Therefore, based on the robust performance demonstrated in diploid and polyploid organisms, particularly in a practical example of Allium species, we recommend employing loop estimations instead of single predictions for GS evaluation at the genus-level species.

K-mer-based algorithms are widely applied in various genome research areas [20], including haplotype-resolved assembly, subgenome phasing and polyploid identification. K-mers enhance the signal of genomic heterozygosity characteristic. For example, human genome with an estimated only ~ 0.1% single-nucleotide heterozygosity can produce nearly 2% haplotype specific 21-mers [65], which have proven effective as haplotype markers for constructing phased assembly graphs [65, 66]. In this study, our analysis of loop-plotted K-mer spectra at varying K values revealed that increased K-mer length enhances the algorithm’s sensitivity to genomic heterozygosity. Conversely, it has been reported that the sensitivity of K-mer to repetitive sequences is exploited to phase subgenomes, with unique repetitive features identified as subgenome-defining markers through clustering and principal component analysis (PCA) of differential K-mers in SubPhaser [67]. Our study demonstrated that K-mers sensitivity to repetitive sequences increases with smaller K values. However, to avoid K-mer duplication due to insufficient length, K should generally be greater than or equal to the minimum value calculated using the formula Inline graphic, where LG represents GS [21, 28]. Moreover, K-mer sensitivity to both repetitive sequences and heterozygosity has been leveraged in Smudgeplot [25] to infer ploidy. Actually, WGD is one mechanism leading to polyploid formation [6872], and our findings indicate that low-K level spectra can sensitively capture genomic information related to WGD traces, appearing as repeat features in diploid genomes. Although further evidence is required to confirm this finding, we propose that a K-mer-based approach could be developed to detect WGD. This would involve establishing the appropriate K range that reflects WGD characteristics, enabling differentiation between intra-set chromosome repeats (indicative of WGD) and inter-set chromosome repeats (associated with polyploidy). Therefore, our findings regarding the trade-off in K-mer genomic feature detection and the connection between low-K level K-mer spectra and WGD could provide valuable guidance for selecting K-mer lengths in various K-mer-based strategies, tailored to specific contexts and research objectives. Moreover, these findings may inspire the development of innovative K-mer based approaches.

Based on a robust strategy, we developed an open-source pipeline named “LVgs”, which integrated an optional read error corrector, a K-mer counter and a GS predictor. In this study, three specific high-performing programs were fixedly combined, including hifiasm [43] as reads error corrector, FastK (https://github.com/thegenemyers/FASTK) as K-mer counter and GenomeScope 2.0 [25] as GS predictor. Through single-pass evaluation optimization via parameter adjustment, LVgs demonstrated inherent resilience against scenario complexity (Fig. 4A-F, Supplementary Fig. S9, S23A-B). However, beyond providing specific tools, we aim to inspire effective strategies that optimize value search in iterative predicting to enhance genome size estimation. The field of genome research is rapidly evolving, with the continuous emergence of new tools and updates to existing ones for diverse demands [25, 43, 7376]. Advancements in technology ecology of genome research have driven the development of tools, such as the artificial intelligence-driven predictor ‘PDLLMs’ for genomic functional region [77], the machine learning-based variant detector ‘DRAGEN’ [78], the sequencing technique-driven assembler ‘hifiasm’ [43] and the quantum hardware-driven assembler ‘VRP’ [79]. We can expect to see further breakthroughs in the development of research tools. In the ‘LVgs’ framework, various alternatives exist for each component, each with its own advantages and limitations. These include HiCanu [80], PECAT [81] and Lighter [82] as reads error corrector for third-generation sequencing or next-generation sequencing, Jellyfish [83], KMC [84] and KFC [85] as K-mer counter, and Kmergenie [86], FindGSE [87] and RESPECT [88] as GS predictor. The selection of these tools should be based on factors such as the read data type, genomic characteristics, pipeline compatibility and available computational resources. Thus, whether focusing on future tool development or evaluating existing algorithms, the LVgs framework should remain flexible and customizable. The specific fixed combination used in this study, “hifiam-Fastk-GenomeScope 2.0”, should be provisional. The study emphasizes using stable value calculations based on multi-length K-mer GS loop predictions. For instance, during the writing of this article, GenomeScope 2.0 was updated to improve the compatibility with FastK. Upon substituting the old version in LVgs with the new one, we observed improved performed in low-K predictions and limiting values that were closer to the true values (Supplementary Fig. 22). Additionally, we implemented an ancillary pipeline, “KMC-GenomeScope 2.0”, to enhance K-mer histogram by including higher counter K-mers in each single-pass evaluation, thus improving the accuracy of the final results (Supplementary Fig. 23).

Conclusions

The nondeterminacy of GS evaluation, which was due to multi-factors and a sensitivity trade-off of K-mer length, brought on inconsistency between the evaluation and assembly size. Making the best use of HiFi reads, we proposed a wholescale-K loop estimations strategy for ‌accurate‌ GS evaluation with length-increasing K-mers, which commendably solved the ‌value-nondeterministic problem of single prediction. We believe our research offer a robustness tool for GS evaluation.

Methods

Data collection

This study collected fourteen datasets, including HiFi reads and Illumina reads of twelve samples sourced from published research. The heterozygous HG00733, with ~ 104 Gb HiFi reads [41], was used to evaluate the performance evaluation of the LVgs strategy in common diploid species. Additionally, three homozygous diploids including human CHM13 (~ 176 Gb HiFi reads) [47], Arabidopsis thaliana Col-0 (~ 23 Gb HiFi reads) [48] and Oryza sativa ssp. japonica cv. Nipponbare (~ 33 Gb HiFi reads) [49] were collected to investigate limitations of GenomeScope 2.0 in homozygous genomes and assess LVgs performance in these contexts. Furthermore, three Allium species [22] including Allium cepa (GS: 16 Gb) with ~ 436 Gb HiFi reads, Allium fistulosum (GS: 11Gb) with ~ 493 Gb HiFi reads and Allium sativum (GS: 16 Gb) with ~ 300 Gb HiFi reads were collected to evaluate LVgs performance in large genomes. Two autopolyploid including triploid banana (Musa acuminata) with ~ 102 Gb HiFi reads [57] and tetraploid potato (Solanum tuberosum) with ~ 96 Gb HiFi reads [58] and one subgenomes-asymmetrical-chromosomes allotetraploid Agapanthus africanus with ~ 184 Gb HiFi reads [22] were collected to evaluate LVgs performance in polyploidy. Two WGD-unexperienced species including Anthoceros agrestis with ~ 27 Gb Illumina reads [36] and Chloropicon primus with ~ 8 Gb Illumina reads [37] were included as control to explore the relationship between K-mer spectra and WGD, and two Illumina datasets of human CHM13 [47] and Arabidopsis thaliana Col-0 [48] were used to further verify the results by eliminating potential biases from sequencing technology differences. Detailed information was summarized in Supplementary Table 1.

Hifiasm v.0.19.9 [43] was used to correct the HiFi reads and Fastp v.0.21.0 [89] was employed for quality control and preprocessing of Illumina reads.

LVgs strategy implementation

We developed a custom pipeline (Supplementary Fig. 24), LVgs, to achieve more precise and robust genome size (GS) evaluation at the genus-level. This pipeline employs a strategy of loop predicting with increasing K-mer lengths and is implemented within a shell environment. In brief of the loop predicting, (1) K‑mers are counted and the K-mer profiles are computed using FastK v.1.0.0 (https://github.com/thegenemyers/FASTK). (2) The output files, “*.hist”, were manipulated by Histex (https://github.com/thegenemyers/FASTK) to produce the special ASCII histograms for genomescope2. (3) Then, the K-mers frequency histograms were subsequently analyzed using genomescope2 [25] for GS predictions. (4) The results were further refined using a custom shell script. A ‘for’ loop command was implemented to execute the above series commands (1–4) iteratively for each K-mer length during the growth phase. Subsequently, a custom R script was employed to plot the trend of GS estimates and seven auxiliary indicators, including total K-mers counts, K-mers type numbers, inferred‌ read error rates, ‌inferred‌ average K-mer coverages, ‌inferred‌ unique lengths, ‌inferred‌ repeat lengths and model matching rates. The limiting value was defined as the maximum value within the convergence region of the GS predictions. The standard Dickey-Fuller test [90, 91] was performed to assess the stationarity of target GS predictions segment using the R package tseries v.0.10–58 [92]. The convergences of all sub-segments from the GS curve, including the limiting value with a continuous K number of no fewer than 5, were estimated by this method, as well as their reversed GS predictions sequences. The result with the minimum p-value was finally reported.

Detection and location of collinearity blocks in WGD genomes

JCVI toolkit v.1.1.23 [93] was employed to identify conserved synteny blocks within the human CHM13 genome and A. thaliana Col-0 (version: Col-XJTU) through intra-genomic chromosome synteny analysis. In brief, the “compara.catalog” module was used to identify intra-genomic orthologs by calling an all-against-all program [94]. Then, synteny blocks were extracted using the ‘compara.synteny’. Furthremore, intra-genomic macrosynteny was visualized using the ‘graphics.karyotype’.

Finally, the collinearity ‘simple’ files, which document the start and stop genes for each collinearity block, were used to acquire the genome sequences of these blocks through the custom script.

Frequency survey of K-mers from collinearity blocks

FastK suite was used to identify all K-mers of the collinearity blocks and HiFi reads and calculate the K-mer counts of HiFi reads. To compare the histograms with those of all K-mers from HiFi reads, we retrieved the K-mer counts in HiFi reads for K-mers shared with collinearity blocks.

Supplementary Information

Supplementary Material 1 (827.5KB, xlsx)
Supplementary Material 2 (113.3MB, docx)

Acknowledgements

Not applicable.

Abbreviations

GS

Genome size

TE

Transposable elements

WGD

whole genome duplication

NGS

Next generation sequencing

TGS

Third-generation sequencing

HiFi

High fidelity

CHM13

Complete androgenetic hydatidiform mole CHM13hTERT cell line

T2T

Telomere-to-telomere

Authors’ contributions

K.L. and S.X. conceived and designed this project. J.X. and S.X. conceived, designed, and implemented the LVgs pipeline. J.X. and J.H. collected and curated the published data. J.X. and J.H. executed the pipeline and analyzed the performance of LVgs. J.X. wrote the manuscript. C.T., K.L. and S.X. reviewed the manuscript. All authors have read and approved the final version of this paper.

Funding

This work was partially supported by grants from the National Key R&D Program of China (2021YFA0909600), National Natural Science Foundation of China (31825007;32160383), Innovational Fund for Scientific and Technological Personnel of Hainan Province (KJRC2023B20), the earmarked fund for Tropical High-efficiency Agricultural Industry Technology System of Hainan University (THAITS-3) and Major (Key) Science and Technology Projects in Jinhua City (2022-2-023).

Data availability

LVgs is deposited in GitHub at https://github.com/xingjianfeng100/LVgs. The GitHub repository is released under an open-source MIT License. LVgs is a lightweight, compile-free tool that combines shell scripting and R programming. The GitHub repository contains comprehensive examples for running LVgs. The details of the publish sequencing data used in this study are reported in Supplementary Table 1. The codes with parameter setting relevant to the analyses in this study are available in “Supplementary_code.md”, which is a plain-text document formatted using Markdown. All supporting documents generated directly from FastK and Genomescop 2.0 are publicly available at 10.5281/zenodo.16032923.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Shangqian Xie, Email: sqianxie@gmail.com.

Kaiye Liu, Email: kaiyeliu@hainanu.edu.cn.

References

  • 1.Blommaert J. Genome size evolution: towards new model systems for old questions. Proc Biol Sci. 2020;287(1933): 20201441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Elliott TA, Gregory TR. What’s in a genome? The c-value enigma and the evolution of eukaryotic genome content. Philos Trans R Soc Lond B Biol Sci. 2015;370(1678): 20140331. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Lefebure T, Morvan C, Malard F, Francois C, Konecny-Dupre L, Gueguen L, Weiss-Gayet M, Seguin-Orlando A, Ermini L, Sarkissian C, et al. Less effective selection leads to larger genomes. Genome Res. 2017;27(6):1016–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Gregory TR. Synergy between sequence and size in large-scale genomics. Nat Rev Genet. 2005;6(9):699–708. [DOI] [PubMed] [Google Scholar]
  • 5.Pellicer J, Hidalgo O, Dodsworth S, Leitch IJ. Genome size diversity and its impact on the evolution of land plants. Genes. 2018;9(2):88. 10.3390/genes9020088 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Hidalgo O, Pellicer J, Christenhusz M, Schneider H, Leitch AR, Leitch IJ. Is there an upper limit to genome size? Trends Plant Sci. 2017;22(7):567–73. [DOI] [PubMed] [Google Scholar]
  • 7.Yin D, Schwarz EM, Thomas CG, Felde RL, Korf IF, Cutter AD, Schartner CM, Ralston EJ, Meyer BJ, Haag ES. Rapid genome shrinkage in a self-fertile nematode reveals sperm competition proteins. Science. 2018;359(6371):55–61. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Adams PE, Eggers VK, Millwood JD, Sutton JM, Pienaar J, Fierst JL. Genome size changes by duplication, divergence, and insertion in caenorhabditis worms. Mol Biol Evol. 2023;40(3):msad039. 10.1093/molbev/msad039 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Vitales D, Álvarez I, Garcia S, Hidalgo O, Nieto Feliner G, Pellicer J, Vallès J, Garnatje T. Genome size variation at constant chromosome number is not correlated with repetitive DNA dynamism in anacyclus (Asteraceae). Ann Bot. 2019;125(4):611–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Agudo AB, Torices R, Loureiro J, Castro S, Castro M, Alvarez I. Genome size variation in a hybridizing diploid species complex in (Asteraceae: Anthemideae). Int J Plant Sci. 2019;180(5):374–85. [Google Scholar]
  • 11.Stein JC, Yu Y, Copetti D, Zwickl DJ, Zhang L, Zhang C, Chougule K, Gao D, Iwata A, Goicoechea JL, et al. Genomes of 13 domesticated and wild rice relatives highlight genetic conservation, turnover and innovation across the genus Oryza. Nat Genet. 2018;50(2):285–96. [DOI] [PubMed] [Google Scholar]
  • 12.Bozan I, Achakkagari SR, Anglin NL, Ellis D, Tai HH, Stromvik MV. Pangenome analyses reveal impact of transposable elements and ploidy on the evolution of potato species. Proc Natl Acad Sci U S A. 2023;120(31): e2211117120. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Kress WJ, Soltis DE, Kersey PJ, Wegrzyn JL, Leebens-Mack JH, Gostel MR, Liu X, Soltis PS. Green plant genomes: what we know in an era of rapidly expanding opportunities. Proc Natl Acad Sci U S A. 2022;119(4): e2115640118. 10.1073/pnas.2115640118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.He W, Li X, Qian Q, Shang L. The developments and prospects of plant super pangenomes: demands, approaches and applications. Plant Commun 2024;6(2):101230. [DOI] [PMC free article] [PubMed]
  • 15.Gregory TR, Nicol JA, Tamm H, Kullman B, Kullman K, Leitch IJ, Murray BG, Kapraun DF, Greilhuber J, Bennett MD. Eukaryotic genome size databases. Nucleic Acids Res. 2007;35(Database issue):D332-338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Pflug JM, Holmes VR, Burrus C, Johnston JS, Maddison DR. Measuring genome sizes using read-depth, k-mers, and flow cytometry: methodological comparisons in beetles (Coleoptera). G3: Genes|Genomes|Genetics. 2020;10(9):3047–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Pfenninger M, Schonnenbeck P, Schell T. ModEst: accurate estimation of genome size from next generation sequencing data. Mol Ecol Resour. 2022;22(4):1454–64. [DOI] [PubMed] [Google Scholar]
  • 18.Guenzi-Tiberi P, Istace B, Alsos IG, Coissac E, Lavergne S, Aury JM, Denoeud F. LocoGSE, a sequence-based genome size estimator for plants. Front Plant Sci. 2024;15: 1328966. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Natarajan S, Gehrke J, Pucker B. Mapping-based genome size estimation. BMC Genomics. 2025;26(1): 482. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Moeckel C, Mareboina M, Konnaris MA, Chan CSY, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hesse U. K-mer-based genome size estimation in theory and practice. Methods Mol Biol. 2023;2672:79–113. [DOI] [PubMed] [Google Scholar]
  • 22.Hao F, Liu X, Zhou BT, Tian ZZ, Zhou LN, Zong H, Qi JY, He J, Zhang YT, Zeng P, et al. Chromosome-level genomes of three key allium crops and their trait evolution. Nat Genet. 2023;55:1976-1986. 10.1038/s41588-023-01546-0 [DOI] [PubMed] [Google Scholar]
  • 23.Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Wenger AM, Peluso P, Rowell WJ, Chang PC, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019;37(10):1155–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ranallo-Benavidez TR, Jaron KS, Schatz MC. GenomeScope 2.0 and smudgeplot for reference-free profiling of polyploid genomes. Nat Commun. 2020;11:1432. 10.1038/s41467-020-14998-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Scarano C, Veneruso I, De Simone RR, Di Bonito G, Secondino A, D’Argenio V. The third-generation sequencing challenge: novel insights for the omic sciences. Biomolecules. 2024;14(5): 568. 10.3390/biom14050568 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Espinosa E, Bautista R, Larrosa R, Plata O. Advancements in long-read genome sequencing technologies and algorithms. Genomics. 2024;116(3): 110842. [DOI] [PubMed] [Google Scholar]
  • 28.Zhao Z, Ng YK, Fang X, Li S. Eliminating heterozygosity from reads through coverage normalization. 2016 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2016:174–177. 10.1109/BIBM.2016.7822514
  • 29.Sun J, Zhang YF, Wang MH, Guan Q, Yang XJ, Ou JX, Yan MC, Wang CR, Zhang Y, Li ZH, et al. The biological significance of multi-copy regions and their impact on variant discovery. Genomics Proteomics Bioinformatics. 2020;18(5):516–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Makino T, McLysaght A. Ohnologs in the human genome are dosage balanced and frequently associated with disease. Proc Natl Acad Sci U S A. 2010;107(20):9270–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Nakatani Y, Takeda H, Kohara Y, Morishita S. Reconstruction of the vertebrate ancestral genome reveals dynamic genome reorganization in early vertebrates. Genome Res. 2007;17(9):1254–65. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Dehal P, Boore JL. Two rounds of whole genome duplication in the ancestral vertebrate. PLoS Biol. 2005;3(10): e314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.McLysaght A, Hokamp K, Wolfe KH. Extensive genomic duplication during early chordate evolution. Nat Genet. 2002;31(2):200–4. [DOI] [PubMed] [Google Scholar]
  • 34.Bowers JE, Chapman BA, Rong J, Paterson AH. Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature. 2003;422(6930):433–8. [DOI] [PubMed] [Google Scholar]
  • 35.Qiao X, Zhang SL, Paterson AH. Pervasive genome duplications across the plant tree of life and their links to major evolutionary innovations and transitions. Comput Struct Biotechnol J. 2022;20:3248–56. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Li FW, Nishiyama T, Waller M, Frangedakis E, Keller J, Li Z, Fernandez-Pozo N, Barker MS, Bennett T, Blazquez MA, et al. Anthoceros genomes illuminate the origin of land plants and the unique biology of hornworts. Nat Plants. 2020;6(3):259–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Lemieux C, Turmel M, Otis C, Pombert JF. A streamlined and predominantly diploid genome in the tiny marine green Alga. Nat Commun. 2019;10(1):4061. 10.1038/s41467-019-12014-x [DOI] [PMC free article] [PubMed]
  • 38.Sun P, Jiao B, Yang Y, Shan L, Li T, Li X, Xi Z, Wang X, Liu J. WGDI: a user-friendly toolkit for evolutionary analyses of whole-genome duplications and ancestral karyotypes. Mol Plant. 2022;15(12):1841–51. [DOI] [PubMed] [Google Scholar]
  • 39.Rabier CE, Ta T, Ane C. Detecting and locating whole genome duplications on a phylogeny: a probabilistic approach. Mol Biol Evol. 2014;31(3):750–62. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Kellis M, Birren BW, Lander ES. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature. 2004;428(6983):617–24. [DOI] [PubMed] [Google Scholar]
  • 41.Vollger MR, Guitart X, Dishuck PC, Mercuri L, Harvey WT, Gershman A, Diekhans M, Sulovari A, Munson KM, Lewis AP, et al. Segmental duplications and their variation in a complete human genome. Science. 2022;376(6588):55–. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Cheng H, Jarvis ED, Fedrigo O, Koepfli KP, Urban L, Gemmell NJ, Li H. Haplotype-resolved assembly of diploid genomes without parental data. Nat Biotechnol. 2022;40(9):1332–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Cheng H, Asri M, Lucas J, Koren S, Li H. Scalable telomere-to-telomere assembly for diploid and polyploid genomes with double graph. Nat Methods. 2024;21(6):967–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic DNA k-mer spectra: models and modalities. Genome Biol. 2009;10(10):R108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Cheng L, Wang N, Bao Z, Zhou Q, Guarracino A, Yang Y, Wang P, Zhang Z, Tang D, Zhang P, et al. Leveraging a phased pangenome for haplotype design of hybrid potato. Nature. 2025;640:408-417. 10.1038/s41586-024-08476-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Hardigan MA, Laimbeer FPE, Newton L, Crisovan E, Hamilton JP, Vaillancourt B, Wiegert-Rininger K, Wood JC, Douches DS, Farre EM, et al. Genome diversity of tuber-bearing solanum uncovers complex evolutionary history and targets of domestication in the cultivated potato. Proc Natl Acad Sci U S A. 2017;114(46):E9999-10008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44-53. 10.1126/science.abj6987 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Wang B, Yang X, Jia Y, Xu Y, Jia P, Dang N, Wang S, Xu T, Zhao X, Gao S, et al. High-quality Arabidopsis thaliana genome assembly with nanopore and HiFi long reads. Genomics Proteomics Bioinformatics. 2022;20(1):4–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Shang LG, He WC, Wang TY, Yang YX, Xu Q, Zhao XJ, Yang LB, Zhang H, Li XX, Lv Y, et al. A complete assembly of the rice Nipponbare reference genome. Mol Plant. 2023;16(8):1232–6. [DOI] [PubMed] [Google Scholar]
  • 50.Xu S, Chen R, Zhang X, Wu Y, Yang L, Sun Z, Zhu Z, Song A, Wu Z, Li T, et al. The evolutionary tale of lilies: giant genomes derived from transposon insertions and polyploidization. Innovation (Camb). 2024;5(6): 100726. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Healey AL, Garsmeur O, Lovell JT, Shengquiang S, Sreedasyam A, Jenkins J, Plott CB, Piperidis N, Pompidor N, Llaca V, et al. The complex polyploid genome architecture of sugarcane. Nature. 2024;628(8009):804–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Schartl M, Woltering JM, Irisarri I, Du K, Kneitz S, Pippel M, Brown T, Franchini P, Li J, Li M, et al. The genomes of all lungfish inform on genome expansion and tetrapod evolution. Nature. 2024;624(8032):96-103. 10.1038/s41586-024-07830-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Shao C, Sun S, Liu K, Wang J, Li S, Liu Q, Deagle BE, Seim I, Biscontin A, Wang Q, et al. The enormous repetitive Antarctic Krill genome reveals environmental adaptations and population insights. Cell. 2023;186(6):1279–94. e1219. 10.1016/j.cell.2023.02.005 [DOI] [PubMed]
  • 54.Peng Y, Yan H, Guo L, Deng C, Wang C, Wang Y, Kang L, Zhou P, Yu K, Dong X, et al. Reference genome assemblies reveal the origin and evolution of allohexaploid oat. Nat Genet. 2022;54(8):1248–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Chen W, Yan M, Chen S, Sun J, Wang J, Meng D, Li J, Zhang L, Guo L. The complete genome assembly of Nicotiana benthamiana reveals the genetic and epigenetic landscape of centromeres. Nat Plants. 2024;10(12):1928–43. [DOI] [PubMed] [Google Scholar]
  • 56.Zhang J, Qi Y, Hua X, Wang Y, Wang B, Qi Y, Huang Y, Yu Z, Gao R, Zhang Y, et al. The highly allo-autopolyploid modern sugarcane genome and very recent allopolyploidization in saccharum. Nat Genet. 2025;57:242-253. 10.1038/s41588-024-02033-w [DOI] [PubMed] [Google Scholar]
  • 57.Huang HR, Liu X, Arshad R, Wang X, Li WM, Zhou Y, Ge XJ. Telomere-to-telomere haplotype-resolved reference genome reveals subgenome divergence and disease resistance in triploid Cavendish banana. Hortic Res. 2023;10(9): uhad153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Bao Z, Li C, Li G, Wang P, Peng Z, Cheng L, Li H, Zhang Z, Li Y, Huang W, et al. Genome architecture and tetrasomic inheritance of autotetraploid potato. Mol Plant. 2022;15(7):1211–26. [DOI] [PubMed] [Google Scholar]
  • 59.Fernandez P, Amice R, Bruy D, Christenhusz MJM, Leitch IJ, Leitch AL, Pokorny L, Hidalgo O, Pellicer J. A 160 Gbp fork fern genome shatters size record for eukaryotes. iScience. 2024;27(6): 109889. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Meyers LA, Levin DA. On the abundance of polyploids in flowering plants. Evolution. 2006;60(6):1198–206. [PubMed] [Google Scholar]
  • 61.Reis AC, Franco AL, Campos VR, Souza FR, Zorzatto C, Viccini LF, Sousa SM. rDNA mapping, heterochromatin characterization and AT/GC content of Agapanthus africanus (L.) Hoffmanns (Agapanthaceae). An Acad Bras Cienc. 2016;88(3 Suppl):1727–34. [DOI] [PubMed] [Google Scholar]
  • 62.Ohri D, Fritsch RM, Hanelt P. Evolution of genome size in allium (Alliaceae). Plant Syst Evol. 1998;210(1):57–86. [Google Scholar]
  • 63.Ricroch A, Yockteng R, Brown SC, Nadot S. Evolution of genome size across some cultivated allium species. Genome. 2005;48(3):511–20. [DOI] [PubMed] [Google Scholar]
  • 64.Greilhuber J, Dolezel J, Lysak MA, Bennett MD. The origin, evolution and proposed stabilization of the terms ‘genome size’ and ‘C-value’ to describe nuclear DNA contents. Ann Bot. 2005;95(1):255–260. 10.1093/aob/mci019 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Koren S, Rhie A, Walenz BP, Dilthey AT, Bickhart DM, Kingan SB, Hiendleder S, Williams JL, Smith TPL, Phillippy AM. De novo assembly of haplotype-resolved genomes with trio binning. Nat Biotechnol. 2018;36:1174-1182. 10.1038/nbt.4277 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 66.Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021;18(2):170–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Jia KH, Wang ZX, Wang LX, Li GY, Zhang W, Wang XL, Xu FJ, Jiao SQ, Zhou SS, Liu H, et al. Subphaser: a robust allopolyploid subgenome phasing method based on subgenome-specific k-mers. New Phytol. 2022;235(2):801–9. [DOI] [PubMed] [Google Scholar]
  • 68.Wendel JF. Genome evolution in polyploids. Plant Mol Biol. 2000;42(1):225–49. [PubMed] [Google Scholar]
  • 69.Otto SP, Whitton J. Polyploid incidence and evolution. Annu Rev Genet. 2000;34(1):401–37. [DOI] [PubMed] [Google Scholar]
  • 70.Soltis PS, Soltis DE. The role of hybridization in plant speciation. Annu Rev Plant Biol. 2009;60:561–88. [DOI] [PubMed] [Google Scholar]
  • 71.del Pozo JC, Ramirez-Parra E. Whole genome duplications in plants: an overview from Arabidopsis. J Exp Bot. 2015;66(22):6991–7003. [DOI] [PubMed] [Google Scholar]
  • 72.Eckardt NA. Two genomes are better than one: widespread paleopolyploidy in plants and evolutionary effects. Plant Cell. 2004;16(7):1647– 1649. 10.1105/tpc.160710 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 73.Li H, Durbin R. Genome assembly in the telomere-to-telomere era. Nat Rev Genet. 2024;25(9):658-670. 10.1038/s41576-024-00718-w [DOI] [PubMed] [Google Scholar]
  • 74.Alser M, Rotman J, Deshpande D, Taraszka K, Shi H, Baykal PI, Yang HT, Xue V, Knyazev S, Singer BD, et al. Technology dictates algorithms: recent developments in read alignment. Genome Biol. 2021. 10.1186/s13059-021-02443-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Bates S, Dessimoz C, Nevers Y. OMAnnotator: a novel approach to Building an annotated consensus genome sequence. BioRxiv. 2024;626846. [Google Scholar]
  • 76.Zeng XF, Yi ZL, Zhang XT, Du YH, Li Y, Zhou ZQ, Chen SJ, Zhao HJ, Yang S, Wang YB, et al. Chromosome-level scaffolding of haplotype-resolved assemblies using Hi-C data without reference genomes. Nat Plants. 2024;10: 1184-1200. 10.1038/s41477-024-01755-3 [DOI] [PubMed] [Google Scholar]
  • 77.Liu G, Chen L, Wu Y, Han Y, Bao Y, Zhang T. PDLLMs: A group of tailored DNA large Language models for analyzing plant genomes. Mol Plant. 2024;18(2):175-178 . 10.1016/j.molp.2024.12.006 [DOI] [PubMed]
  • 78.Behera S, Catreux S, Rossi M, Truong S, Huang ZY, Ruehle M, Visvanath A, Parnaby G, Roddey C, Onuchic V, et al. Comprehensive genome analysis and variant detection at scale using DRAGEN. Nat Biotechnol. 2024. 10.1038/s41587-024-02382-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Chen Y, Huang JH, Sun Y, Zhang Y, Li Y, Xu X. Haplotype-resolved assembly of diploid and polyploid genomes using quantum computing. Cell Rep Methods. 2024;4(5): 100754. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Nurk S, Walenz BP, Rhie A, Vollger MR, Logsdon GA, Grothe R, Miga KH, Eichler EE, Phillippy AM, Koren S. HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads. Genome Res. 2020;30(9):1291–305. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Nie F, Ni P, Huang N, Zhang J, Wang Z, Xiao C, Luo F, Wang J. De novo diploid genome assembly using long noisy reads. Nat Commun. 2024;15(1):2964. 10.1038/s41467-024-47349-7 [DOI] [PMC free article] [PubMed]
  • 82.Song L, Florea L, Langmead B. Lighter: fast and memory-efficient sequencing error correction without counting. Genome Biol. 2014;15:509. 10.1186/s13059-014-0509-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 83.Marcais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 84.Kokot M, Dlugosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017;33(17):2759–2761. 10.1093/bioinformatics/btx304 [DOI] [PubMed] [Google Scholar]
  • 85.Martayan I, Robidou L, Shibuya Y, Limasset A. Hyper-k-mers: efficient streaming k-mers representation. bioRxiv. 2024:2024.2011.2006.620789 . 10.1101/2024.11.06.620789
  • 86.Chikhi R, Medvedev P. Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2014;30(1):31–7. [DOI] [PubMed] [Google Scholar]
  • 87.Sun H, Ding J, Piednoel M, Schneeberger K. FindGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics. 2018;34(4):550–7. [DOI] [PubMed] [Google Scholar]
  • 88.Sarmashghi S, Balaban M, Rachtman E, Touri B, Mirarab S, Bafna V. Estimating repeat spectra and genome length from low-coverage genome skims with RESPECT. PLoS Comput Biol. 2021;17(11): e1009449. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 89.Chen S, Zhou Y, Chen Y, Gu J. Fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018;34(17):i884-90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Said SE, Dickey DA. Testing for unit roots in autoregressive-moving average models of unknown order. Biometrika. 1984;71(3):599–607. 10.1093/biomet/71.3.599
  • 91.Banerjee A, Dolado JJ, Galbraith JW, Hendry D. Co-integration, error correction, and the econometric analysis of Non-Stationary data. Oxford University Press; 1993.
  • 92.Trapletti A, Hornik K. Tseries: time series analysis and computational finance: R package version 0.10–58: https://CRAN.R-project.org/package=tseries; 2024.
  • 93.Tang H, Krishnakumar V, Zeng X, Xu Z, Taranto A, Lomas JS, Zhang Y, Huang Y, Wang Y, Yim WC, et al. JCVI: a versatile toolkit for comparative genomics analysis. Imeta. 2024;3(4): e211. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Kielbasa SM, Wan R, Sato K, Horton P, Frith MC. Adaptive seeds tame genomic sequence comparison. Genome Res. 2011;21(3):487–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1 (827.5KB, xlsx)
Supplementary Material 2 (113.3MB, docx)

Data Availability Statement

LVgs is deposited in GitHub at https://github.com/xingjianfeng100/LVgs. The GitHub repository is released under an open-source MIT License. LVgs is a lightweight, compile-free tool that combines shell scripting and R programming. The GitHub repository contains comprehensive examples for running LVgs. The details of the publish sequencing data used in this study are reported in Supplementary Table 1. The codes with parameter setting relevant to the analyses in this study are available in “Supplementary_code.md”, which is a plain-text document formatted using Markdown. All supporting documents generated directly from FastK and Genomescop 2.0 are publicly available at 10.5281/zenodo.16032923.


Articles from BMC Genomics are provided here courtesy of BMC

RESOURCES