Abstract
Plant cells have two major organelles with their own genomes: chloroplasts and mitochondria. While chloroplast genomes tend to be structurally conserved, the mitochondrial genomes of plants, which are much larger than those of animals, are characterized by complex structural variation. We introduce TIPPo, a user-friendly, reference-free assembly tool that uses PacBio high-fidelity long-read data and that does not rely on genomes from related species or nuclear genome information for the assembly of organellar genomes. TIPPo employs a deep learning model for initial read classification and leverages k-mer counting for further refinement, significantly reducing the impact of nuclear insertions of organellar DNA on the assembly process. We used TIPPo to completely assemble a set of 54 complete chloroplast genomes. No other tool was able to completely assemble this set. TIPPo is comparable with PMAT in assembling mitochondrial genomes from most species but does achieve even higher completeness for several species. We also used the assembled organelle genomes to identify instances of nuclear plastid DNA (NUPTs) and nuclear mitochondrial DNA (NUMTs) insertions. The cumulative length of NUPTs/NUMTs positively correlates with the size of the nuclear genome, suggesting that insertions occur stochastically. NUPTs/NUMTs show predominantly C:G to T:A changes, with the mutated cytosines typically found in CG and CHG contexts, suggesting that degradation of NUPT and NUMT sequences is driven by the known elevated mutation rate of methylated cytosines. Small interfering RNA loci are enriched in NUPTs and NUMTs, consistent with the RdDM pathway mediating DNA methylation in these sequences.
Keywords: chloroplast genome, mitochondrial genome, genome assembly, nuclear insertions of organellar genomes, PacBio HiFi reads
Introduction
In the cells of green plants, DNA is found in three main locations: chloroplasts or chloroplast-related plastids, mitochondria, and the nucleus. The chloroplast is the primary site of photosynthesis, converting solar energy into chemical energy, while mitochondria are crucial for cellular energy metabolism. Chloroplasts and mitochondria are thought to have originated from ancient endosymbiosis events (Zimorski et al. 2014). Due to secondary and tertiary endosymbiosis, chloroplasts or plastids are present across various kingdoms, collectively referred to as photosynthetic eukaryotes (Yoon et al. 2004).
Chloroplast genomes are structurally conserved across species, and they typically comprise four distinct fragments: one large single copy (LSC), one small single copy (SSC), and two inverted repeats (IRs). In contrast, the genomes of mitochondria, present in all eukaryotic organisms except for the microorganism Monocercomonoides sp. (Karnkowska et al. 2016), vary significantly across kingdoms. The structure of bilaterian mitochondrial genomes is conserved, presenting as a single small circular DNA with sizes around 17 kb (Ladoukakis and Zouros 2017). The situation is very different in plants, which have structurally complex mitochondrial genomes with large variation in size, with the largest known mitochondrial genomes reaching up to 11 Mb (Sloan et al. 2012; Putintseva et al. 2020).
Compared with nuclear genomes, much less attention has been paid to the high-quality assembly of organellar genomes. Short-read data are useful, with some caveats, for the assembly of the relatively small and conserved mitochondrial genomes of animals and chloroplast genomes of plants (Dierckxsens et al. 2017; Jin et al. 2020), but their utility is limited for the larger and more complex mitochondrial genomes of plants (Štorchová and Krüger 2024). Long and highly accurate read data have substantially enhanced our ability to assemble nuclear genomes (Wenger et al. 2019; Sereika et al. 2022). With the help of long reads, even highly repetitive regions such as centromeres and telomeres can be assembled (Naish et al. 2021; Nurk et al. 2022; Wlodzimierz et al. 2023), although challenges persist with the assembly of rDNA clusters. Moreover, the typically very high coverage of organellar genomes in data sets of genomic DNA interferes with productive assembly using standard tools, which are optimized for the nuclear genomes (Cheng et al. 2021). In addition, chloroplast and mitochondrial DNA fragments are often transferred to the nucleus, which also interferes with assembly of the true organellar genomes (Uliano-Silva et al. 2023).
Several tools have been developed to enable the specific use of long-read data for organellar genome assembly, primarily focusing on chloroplast genomes, such as Organelle_PBA (Soorni et al. 2017), ptGAUL (Zhou et al. 2023), and CLAW (Phillips et al. 2024). The general approach begins with extracting chloroplast reads from the data set by aligning long reads to the chloroplast genomes of closely related species. This is straightforward and effective for the chloroplast genome, as there are now over 12,000 published chloroplast genomes available, making it almost always possible to find a sufficiently closely related species for successful extraction of chloroplast reads. However, this approach has limitations for mitochondrial genomes, given the much smaller number of available plant mitochondrial genomes (∼500 as of July 2023) and the much lower conservation of mitochondrial genomes, even between closely related species. There have been ongoing efforts to assemble complex mitochondrial genomes. GSAT (He et al. 2022) begins by using short reads to construct the assembly graph and then simplifies it using long reads. However, the assembly graph created from short reads struggles to handle highly repetitive regions, making it challenging to assemble complete genomes. Recently, an alternative approach has been proposed—PMAT (Bi et al. 2024). It begins with downsampling the initial read data set to an estimated coverage of the organellar genomes that is suitable for standard assembly tools. Next, a normal assembly is performed, and then the contigs that appear to belong to organellar genomes are identified based on the presence of conserved protein-coding genes. While useful, this approach may result in incomplete assemblies, especially for species with multichromosomal mitochondrial genomes where some chromosomes lack coding genes (Sanchez-Puerta et al. 2017). Clearly, the preferred approach would be a (largely) reference free and tool for organellar genome assembly that has similar power for both chloroplast and mitochondrial genomes.
As stated above, organellar DNA can be transferred to the nucleus, and it is common to find organellar sequences in the nuclear genome (Richly and Leister 2004a,b; Hazkani-Covo et al. 2010; Michalovova et al. 2013; Zhang et al. 2020). These sequences are known as nuclear mitochondrial DNA (NUMTs) and nuclear chloroplast DNA (NUPTs). The nuclear genome evolves much faster than mitochondrial genome, typically by an order of magnitude (Wolfe et al. 1987; Drouin et al. 2008). Accordingly, NUPTs and NUMTs tend to diverge from the ancestral organellar genomes quite rapidly. By aligning NUPTs and NUMTs, which should not carry any function, to the corresponding organellar genomes, one can explore presumably neutral processes of sequence change in the integrated organellar DNA (Huang et al. 2005; Rousseau-Gueutin et al. 2011; Yoshida et al. 2014; Fields et al. 2022). Questions of interest are whether NUPTs and NUMTs behave in a similar manner, and how their evolutionary fate compares with that of other large insertions, such as transposons (Wang et al. 2013; Maumus and Quesneville 2014).
We have developed TIPPo, a user-friendly, reference-free assembly tool for plant organellar genomes that integrates TIARA, a deep learning–based approach for organellar DNA classification (Karlicki et al. 2022), eliminating the need for knowledge of organellar genomes from closely related species genomes or nuclear genome information of the target species. We use k-mer information to optimize TIARA’s output, distinguishing NUPTs, NUMTs, and misclassifications caused by repetitive sequences. Using TIPPo, we not only successfully assembled 54 complete chloroplast genomes but also demonstrated superior performance in mitochondrial assembly compared with PMAT, revealing the complex structure of mitochondrial genomes. Additionally, we detailed the insertion patterns of NUPTs and NUMTs and analyzed nucleotide substitutions in NUPTs and NUMTs.
Approach
We designed and implemented a reference-free, user-friendly tool for the assembly of plant organellar genomes called TIPPo from highly accurate PacBio high-fidelity (HiFi) long reads. It begins with a deep learning model to identify candidate organelle reads, followed by the use of a k-mer count approach to filter out the remaining nuclear reads and finishing with the assembly of the organellar genomes. Figure 1 illustrates the entire workflow.
Fig. 1.
Workflow of TIPPo.
TIPPo uses TIARA (Karlicki et al. 2022) to classify the reads, a deep learning–based approach that follows a two-step process: first, it classifies reads as nuclear or organellar, and then further categorizes the organellar reads into plastid or mitochondrial. We evaluated the accuracy of TIARA (Karlicki et al. 2022) using Arabidopsis thaliana and Oryza sativa (supplementary figs. S1 and S20, Supplementary Material online). As described in the original paper, TIARA will classify the NUMTs/NUPTs as organelle reads, and there is also an increased proportion of misclassification in highly repeated regions, such as centromeres and rDNA clusters. Hence, further filtering is necessary.
The assumption for subsequent filtering is that true organellar reads are the largest class in the TIARA output, and that misclassifications are relatively rare. We use KMC3 (Kokot et al. 2017) to generate a k-mer (k = 31) count database from the reads identified by TIARA. Next, we perform filtering based on k-mer counts separately for chloroplast and mitochondrial reads. We use readskmercount to obtain the read median k-mer count rmkc, which is used as a representative for each read. Reads labeled as plastid are processed first because chloroplast genomes are more conserved than mitochondrial genomes.
After calculating rmkc for all input reads, the median k-mer count mkc of all chloroplast reads, and of all mitochondrial reads after chloroplast assembly, will be used for filtering. To this end, we set the low k-mer count threshold lkc to 0.3 × mkc, and the high k-mer count threshold hkc to 5 × mkc. A read is removed if more than one-fifth of its k-mer counts are either lower than lkc or higher than hkc. Reads with many k-mer counts below the lkc threshold likely originate from the nucleus, and possibly correspond to NUPTs or NUMTs. Reads with many k-mer counts above the hkc threshold are likely from highly repetitive nuclear regions such as centromeres, and rDNA clusters. After filtering, flye (Kolmogorov et al. 2019) is used to assemble the chloroplast genome in the first assembly step. The assembly is performed iteratively with a random selection of reads, until the assembly graph matches the typical chloroplast structure. In each assembly round, only 800 reads are used, which is around 100× coverage, since excessive coverage might negatively affect flye results. Following the assembly with flye, the assembly graph is checked for a typical chloroplast structure or a circular DNA when IRs were set as lost. The structural check is aiming to match two isomeric chloroplast genomes that coexist equimolarly, differing only in the orientation of the LSC and SSC, as is the case in most land plants and algae (Palmer 1983; Aldrich et al. 1985; Wang and Lanfear 2019). Once this is achieved, the cycle ends with output of two typical heteroplasmic fasta sequences or one circular sequence.
The next step is the assembly of the mitochondrial genome. Considering that some chloroplast reads might be misclassified as mitochondrion by TIARA, GraphAligner (Rautiainen and Marschall 2020) is used to align all reads labeled as mitochondrion to the chloroplast assembly graph as a further refinement step. If the read alignment is almost end-to-end (left clip length ≤ 100 bp, right clip length ≤ 100 bp, and identity > 95%), reads are considered as likely originating from the chloroplast and are removed. It is worth noting that mapping reads directly to an organelle assembly graph is the optimal solution for the organellar genome alignment, since linearized circular DNA combined with heteroplasmy will lead to clipped alignments. CLAW (Phillips et al. 2024) also addresses the alignment issues caused by a linearized circular DNA target by joining the two linear DNA sequences. Although this approach avoids clipping alignment, it introduces the issue of mapping quality of zero.
As a final step in TIPPo, the reads remaining after alignment to the chloroplast assembly graph will be processed by readskmercount to exclude reads originating from the nucleus, as described above. Given that the coverage of mitochondrion is generally lower than that of chloroplast, and the genome sizes are usually larger, all finally remaining reads serve directly as input to flye for generating the assembly graph.
Results and Discussion
Chloroplast Genome Assembly
Given the conserved structure of chloroplast genomes, we categorized the assemblies based on the structure on the assembly graph into three classes: (ⅰ) containing only the typical chloroplast genome or one circular DNA (complete genome); (ⅱ) consisting of the complete genome and other sequences; and (ⅲ) incomplete assembly (Fig. 2).
Fig. 2.
Benchmarking of four chloroplast genome assembly tools and genome statistics. See Materials and Methods for phylogenetic tree. The assemblies for Adenosma buchneroides and Helichrysum umbraculigerum are presented here for the first time. Zygnema circumcarinatum, Taxus chinensis, Glycyrrhiza uralensis, and Trifolium repens have lost IRs, and the three topologically defined regions are therefore not measured.
To test the performance of TIPPo, we selected 54 phylogenetically diverse planta and compared the performance with that of ptGAUL and CLAW. Using TIPPo, we successfully assembled all 54 complete chloroplast genomes without any extraneous sequences (supplementary fig. S2, Supplementary Material online) and assembled 48 complete chloroplast genomes at 0.5× nuclear genome coverage (supplementary fig. S23, Supplementary Material online). We obtained two chloroplast genomes (supplementary fig. S2, Supplementary Material online) for Acorus gramineus, suggesting that the sample might contain reads from two genotypes. It was therefore excluded from downstream analysis. ptGAUL assembled 46 complete genomes, produced 6 assemblies containing complete chloroplast genomes along with other sequences, and was unable to assemble 1 species (supplementary fig. S3, Supplementary Material online). CLAW successfully assembled 35 complete chloroplast genomes, 14 assemblies included complete chloroplast genomes, as well as other sequences, and it did not assemble 4 species (supplementary fig. S4, Supplementary Material online).
Out of 54 species, 4 species have lost IRs, resulting in a single circular structure in the assembly graph (supplementary fig. S2, Supplementary Material online). For the remaining 50 species with IRs, the assembly graph typically consists of 3 nodes: 1 representing the LSC, 1 representing the SSC, and 1 representing the IR, as shown in supplementary fig. S2, Supplementary Material online. The heteroplasmy in chloroplasts is mainly mediated by an IR, so for outputs with typical chloroplast structures in TIPPo, we provide two separate configurations of the chloroplast genome.
Whole-genome alignments against published chloroplast genomes indicated high consistency between the published and TIPPo assemblies (supplementary fig. S6, Supplementary Material online). Typical chloroplast genomes have three distinct regions: SSC, LSC, and IR. Public chloroplast genomes are typically presented as linear circular DNA sequences. Thus, in whole-genome alignments, the single IR from the assembly graph aligned to two regions of the linear representations, with one forward and one reverse orientation (supplementary fig. S6, Supplementary Material online). We also assembled two previously unpublished chloroplast genomes, Adenosma buchneroides (153,640 bp; supplementary fig. S7, Supplementary Material online) and Helichrysum umbraculigerum (154,011 bp; supplementary fig. S8, Supplementary Material online). Comparing the chloroplast genome lengths across 53 species, we observed that those from green algae are larger than those from terrestrial plants, with terrestrial plant chloroplast genomes generally around 150 kb (Fig. 2; supplementary table S4, Supplementary Material online). The base consensus approach was also applied to A. thaliana and Silene conica, resulting in only a 1 and 2 bp insertion, respectively, compared with the published reference genome, both occurring in homopolymer regions (supplementary fig. S21, Supplementary Material online). These could be true minor differences between the exact germplasm used, or due to assembly errors in either the published genomes or in our assemblies.
Mitochondrial Genome Assembly
Only PMAT assembled also mitochondrial genomes, and we therefore compared the ability of TIPPo to assemble mitochondrial genomes with PMAT. Given that PMAT assembled genomes often contain sequences from both organelles, we aligned distinct parts of the mitochondrial assembly graph from both PMAT and TIPPo against the chloroplast genomes assembled by TIPPo. For PMAT, mitochondrial genome assemblies from 33 out of 53 species contained also chloroplast sequences (supplementary fig. S9, Supplementary Material online). For Musa acuminata, Ad. buchneroides, Trapa bicornis (master), and Fragaria vesca (master), all parts aligned fully to the chloroplast genome graph, indicating that the assembly of mitochondrial genomes had failed. Since TIPPo removes chloroplast reads first, none of the assemblies contain chloroplast sequences (supplementary fig. S9, Supplementary Material online). Thus, in subsequent analyses, we removed the chloroplast sequences from PMAT mitochondrial genome assemblies.
Given the structural diversity of plant mitochondrial genomes, it is challenging to assess the completeness of results from the assembly graph structure as we did with chloroplasts (Wang 2024). Inspired by BUSCO (Seppey et al. 2019) for assessing the completeness for nuclear genomes, we use 41 protein-coding genes collected by mitopy (Alverson et al. 2010) to evaluate the completeness of mitochondrial assemblies. Out of the 53 species, 35 mitochondrial genomes had previously been published, which we also included in our evaluation (supplementary tables S5 to S8, Supplementary Material online). Considering that the output of mitochondrial genomes from PMAT and TIPPo is in the form of assembly graphs, where large repetitive fragments are represented only once, we focused on the presence or absence of genes, and did not consider orientation or copy number.
As shown in Fig. 3a, TIPPo and PMAT are in agreement regarding the completeness of protein-coding genes in the mitochondrial assemblies of 43 species. The results based on protein-coding genes are consistent with alignments to the published mitochondrial genomes (supplementary table S16, Supplementary Material online). In eight species, the mitochondrial genomes assembled by TIPPo had higher protein-coding genes completeness, while for two species, PMAT outperformed TIPPo. For M. acuminata, both TIPPo and PMAT failed to assemble the mitochondrial genome.
Fig. 3.
Benchmarking of mitochondrial genome assembly. a) See Materials and Methods for phylogenetic tree. The assemblies for Persea americana, Kobresia myosuroides, Triticum monococcum, Panicum miliaceum, Helichrysum umbraculigerum, Ipomoea cairica, Solanum rostratum, Adenosma buchneroides, Sesamum indicum, Perilla frutescens, Thymus quinquecostatus, Citrus australis, Ochroma pyramidale, Linum usitatissimum, Euphorbia peplus, Carya illinoinensis, and Coriaria nepalensis are presented here for the first time. The numbers inside the circles indicate the number of nonredundant protein-coding genes in the assembly. Light shading indicates superior results with TIPPo or PMAT. b) Whole-genome alignment, including the published, TIPPo and PMAT assemblies (both raw and master), of the S. conica mitochondrial genome, visualized with Alitv (v1.0.6). c) TIPPo assembly graph of S. conica visualized with Bandage (v0.9.0). d) PMAT assembly graph of S. conica visualized with Bandage (v0.9.0).
The seven species in which TIPPo was superior include one red alga and two Chlorophyta for which PMAT failed to output mitochondrial genomes, with the TIPPo assemblies matching the published assemblies for these three species. Although Haematococcus lacustris and Haematococcus pluvialis belong to the same genus, their mitochondrial genomes exhibit poor synteny (supplementary fig. S10, Supplementary Material online).
In S. conica, which has one of the largest mitochondrial genomes (11 Mb), TIPPo assembled a mitochondrial genome that was highly consistent with the published genome (Fig. 3b, supplementary fig. S17, Supplementary Material online). PMAT, in contrast, only assembled parts of the mitochondrial genome. The mitochondrial assembly graph from TIPPo had numerous small circular DNAs (Fig. 3c), which PMAT failed to identify (Fig. 3d). A similar issue with missing small circular DNAs in PMAT occurred in Actinidia chinensis and Linum usitatissimum. TIPPo assembly of Ac. chinensis matched the published genome, which includes a large circular DNA of 724 kb and a smaller circular DNA of 200 kb, whereas PMAT only generates a linearized sequence of the large circle (supplementary fig. S12, Supplementary Material online). In L. usitatissimum, the PMAT assembly had lost two protein-coding genes, rpl5 and rps14, which are present in a circular DNA sequence assembled by TIPPo. Whole-genome alignment again indicated that PMAT the assembly had lost the circular DNA with these two genes (supplementary fig. S13, Supplementary Material online). In Ad. buchneroides, PMAT failed to assemble the mitochondrial genome, whereas TIPPo assembled a 346 kb linear DNA sequence containing 38 protein-coding genes (supplementary fig. S14, Supplementary Material online). Given the number of protein-coding genes in related species—39 in Sesamum, 35 in Perilla, 37 in Salvia, and 36 in Thymus—this suggests that the linear DNA sequence from TIPPo is largely complete. When using low sequence depth data as inputs for the assemblies (1× and 0.5× nuclear genome coverage), both PMAT and TIPPo showed varying degrees of incomplete assembly. Therefore, for assembling mitochondrial genomes, we do not recommend using ultra-low coverage (supplementary fig. S22, Supplementary Material online).
As mentioned, PMAT outperformed TIPPo for two species. For T. bicornis, the TIPPo assembly graph comprised only linear DNA fragments, indicating the erroneous identification of a large number of nonmitochondrial reads. Using verkko to construct a whole-genome assembly graph revealed that T. bicornis possesses a large rDNA cluster that is misidentified by TIARA (supplementary fig. S15, Supplementary Material online). For Herpetospermum pedunculosum, the TIPPo assembly lacked two genes, nad3 and atp6, due to overfiltering by the k-mer approach (supplementary fig. S19, Supplementary Material online). However, the PMAT raw assembly included nonmitochondrial fragments (supplementary fig. S16, Supplementary Material online).
Computational Cost
Using data from 53 species, we performed chloroplast genome assembly with 3 different tools: TIPPo (chloroplast mode), ptGAUL, and CLAW, all with default parameters. Our results show that both TIPPo and CLAW are approximately five times slower than ptGAUL (Fig. 4a). Regarding peak memory usage, TIPPo required the most memory, consuming three times more than CLAW and five times more than ptGAUL (Fig. 4b). For mitochondrial genome assembly, we utilized PMAT in mt mode and TIPPo in organelle mode. PMAT was approximately eight times slower than TIPPo and consumed four times more memory (Fig. 4a and b). For detailed time and memory usage at the different coverages used, please refer to supplementary tables S9 and S10, Supplementary Material online.
Fig. 4.
Computational cost. a) Ratio of elapsed times between each pair of the four tools. b) Ratio of peak memory usage between each pair of the four tools. Gray dots indicate different species. The means are shown as horizontal lines, with the upper and lower box indicating the interquartile range (IQR), and the whiskers extending to the most extreme values within 1.5 times the IQR from the first and third quartiles.
Identification of NUPTs/NUMTs
Next, we wanted to know whether we could improve on the accurate identification of NUPTs and NUMTs and the elimination of potential contamination of nuclear assemblies with pieces of organellar genomes. High-quality nuclear genomes assembled from PacBio HiFi data are available for all of the species used in this study except Lycopodium japonicum, Ochroma pyramidale, and Perilla frutescens, with the assemblies of the latter two being highly fragmented. Because algal genomes are small and have very few NUPTs and NUMTs (Zhang et al. 2020), we excluded them from further analysis. Musa acuminata was not included either, because we had not been able to assemble the mitochondrial genome. For all other 45 nuclear genome assemblies, we retrieved all contigs/scaffolds over 500 kb.
The species with the longest cumulative lengths of NUMTs were S. conica, Amborella trichopoda, Triticum monococcum, Capsicum pubescens, and Taxus chinensis. This might be attributed to S. conica and Am. trichopoda having large mitochondrial genomes (11 and 3.9 Mb) and Tri. monococcum, C. pubescens, and Ta. chinensis having large nuclear genomes (5, 3.9, and 10 Gb). The latter three species also had the highest cumulative lengths of NUPTs (supplementary table S11, Supplementary Material online). As observed before (Zhang et al. 2020), both NUPT and NUMT lengths are positively correlated with nuclear genome size in plants (Pearson’s correlation coefficients of 0.63 and 0.56; Fig. 5a and b). However, no significant association was found when comparing species across different kingdoms (Richly and Leister 2004a,b). Since NUPTs and NUMTs are part of the nuclear genome, their lengths are also positively correlated (Fig. 5c).
Fig. 5.
Comparison of NUPT and NUMT sequences and the corresponding organellar genomes. a) Comparison of cumulative lengths of NUPTs and of nuclear genome size. b) Comparison of cumulative lengths of NUMTs and of nuclear genome size. c) Comparison of cumulative lengths of NUPTs and of NUMTs. d) Cumulative length distribution of NUPTs across different identities. e) Cumulative length distribution of NUMTs as a function of sequence identity with the corresponding mitochondrial genome. f) Correlation between NUPT/chloroplast genome identity and NUMT/mitochondrial genome identity. Bars indicate standard errors.
NUPTs and NUMTs appear to evolve mostly neutrally, as evidenced by the gradual accumulation of mutations (Huang et al. 2005; Noutsos et al. 2005). Because the substitution rates of plant organellar genomes is typically an order of magnitude lower than that of nuclear genomes (Wolfe et al. 1987; Drouin et al. 2008), the number of differences between NUPT and NUMT sequences and the corresponding organellar genomes reflect the age of nuclear insertions (Richly and Leister 2004a,b; Michalovova et al. 2013; Yoshida et al. 2019). We found that recent insertion events, with sequence identities of 98% to 100%, are most frequent (Fig. 5d and e, supplementary table S12, Supplementary Material online), which is also reflected by the correlation of average sequence identities between NUPTs and NUMTs and their organellar genomes being well correlated (Pearson’s correlation coefficient = 0.52; Fig. 5f). We conclude that NUPTs and NUMTs tend to degrade rapidly, which is consistent with individual NUPTs and NUMTs in A. thaliana genomes having low allele frequencies (Igolkina et al. 2024).
Substitution Spectra of NUPTs/NUMTs
C:G > T:A substitutions dominate the substitution spectrum in A. thaliana mutation accumulation lines, both in the greenhouse and in the wild, although not in older natural populations (Ossowski et al. 2010; Cao et al. 2011; Exposito-Alonso et al. 2018; Weng et al. 2019). The excess of C:G > T:A substitutions has been attributed to spontaneous deamination of methylated cytosines (Ossowski et al. 2010), which is found in plants in three contexts, CG, CHG, and CHH, with most of it in the CG context (Law and Jacobsen 2010). Previous studies have found that C:G > T:A substitutions to be the most common substitutions in NUPTs and NUMTs (Huang et al. 2005; Rousseau-Gueutin et al. 2011; Fields et al. 2022). We confirm this phenomenon in our set of 45 species, with the highest substitution rates at CG sites (Fig. 6, supplementary tables S13 and S14, Supplementary Material online).
Fig. 6.
The landscape of substitutions in NUPTs and NUMTs. a) Distribution of nucleotide substitutions in NUPTs, inferred from sequence comparison with the corresponding chloroplast genome. b) Distribution of nucleotide substitutions in NUMTs, inferred from sequence comparison with the corresponding mitochondrial genome. c) Enrichment of cytosine substitutions in NUPTs and NUMTs at CG sites.
Small interfering RNA Targeting NUPTs and NUMTs
The increased substitution rate at CG sites in NUPTs and NUMTs suggested that these are often methylated, which has been directly confirmed in several instances (Yoshida et al. 2014; Fields et al. 2022). The most common type of DNA methylation in plants, RNA-directed DNA methylation, is associated with small interfering RNAs (siRNAs) (Sigman and Slotkin 2016), and we therefore tested the hypothesis that NUPTs and NUMTs are enriched for siRNAs. In a previous study, siRNA data were generated for 11 of the 45 species that we investigated (Lunardon et al. 2020), and we annotated siRNA loci by mapping siRNA reads (Axtell 2013).
For all 11 species, the overlap of siRNA loci with NUPT/NUMTs was significantly higher than expected by chance (Fig. 7, supplementary table S15, Supplementary Material online), demonstrating that siRNAs are indeed enriched in NUPTs and NUMTs.
Fig. 7.
Enrichment of siRNAs in NUPTs and NUMTs. a) Overlap of siRNA loci with NUPTs. b) Overlaps of siRNA loci with NUMTs. Species in (a) and (b) annotated at the bottom. The numbers on top of each bar represent the enrichment, and the error bars represent the 95% CI from random sampling of the genome.
Conclusions
We introduce TIPPo, a user-friendly, reference-free approach for assembling plant organellar genomes. TIPPo provides a streamlined and universal assembly process without the need for external reference genomes. For both chloroplast and mitochondrial genomes, we provide assembly graphs. For chloroplast genomes, we provide in addition information on heteroplasmy. A limitation of our approach is that it can only use high-quality long reads, but we feel this is justified given that this technology underpins many of the ongoing large-scale genome sequencing and assembly projects (Rhie et al. 2021; Darwin Tree of Life Project Consortium 2022; Lewin et al. 2022). We also note that another newly released assembler for plant organellar genomes that comes from some of the colleagues leading these large-scale efforts is also restricted to the use of high-quality long reads (Zhou et al. 2024).
TIPPo outperforms all other tested assemblers for chloroplast genomes. Compared with chloroplast genomes, assessing the performance for mitochondrial genomes is more difficult due to the diversity of plant mitochondrial genomes. Based on the completeness of protein-coding genes, TIPPo outperforms the second-best tool PMAT (Bi et al. 2024) in eight species, while PMAT was superior for two species, T. bicornis and H. pedunculosum. A significant factor appears to be the presence of a large rDNA cluster in the nuclear genome of T. bicornis, which results in poor classification by Tiara (Karlicki et al. 2022), the initial tool used by TIPPo for selecting input reads for the assembly. The incomplete mitochondrial assembly of H. pedunculosum is likely the result of excessive filtering using a k-mer-based approach.
Materials and Methods
Data Sources
HiFi datasets were downloaded from publicly available databases, details refer to supplementary table S1, Supplementary Material online. The accession numbers for chloroplast and mitochondrial genomes are provided in supplementary tables S2 and S3, Supplementary Material online. A phylogenetic tree of the 53 species was constructed with rtrees (https://github.com/daijiang/rtrees; Li 2023).
Evaluation of Tiara for Read Classification
First, minimap2 (2.24-r1122) with the parameter map-hifi was used to align all HiFi reads to the A. thaliana (Rabanal et al. 2022) and O. sativa (Shang et al. 2023) reference genomes, retaining only the primary alignments. Next, Tiara (1.0.3; Karlicki et al. 2022) was used to classify HiFi reads as organellar. A 100 kb sliding window was applied to calculate the proportion of reads classified as organellar by Tiara compared with minimap2 in each window. The results were visualized using ggplot2 (3.5.1).
Parameter Selection for TIARA and Flye
For evaluating the impact of parameters on Flye, we tested: (ⅰ) default parameters; (ⅱ) default parameters with –meta; (ⅲ) default parameters with –keep-haplotypes; and (ⅳ) default parameters with both –meta and –keep-haplotypes (supplementary fig. S18, Supplementary Material online). For selecting the best parameter for TIARA, we used different parameter combinations: k1 with 3 values (4 to 6), k2 with 4 values (4 to 7), and p with 15 values (0.3 to 1), resulting in a total of 180 combinations for reads classification.
Assembly of Organellar Genomes
We used fxTools (v0.1.0; https://github.com/moold/fxTools) for subsampling PacBio HiFi reads to approximate 4× nuclear genome coverage for each species, except for 2× for Ta. chinensis, which has a particularly large nuclear genome (Xiong et al. 2021). For S. conica, with its large mitochondrial genome (Sloan et al. 2012), we used 10× nuclear genome coverage. For Ly. japonicum, the sequenced data coverage is only 0.59× (Bi et al. 2024). We used identical datasets for assembly with the different tools. TIPPo (v2.1) with default parameters was used to assemble chloroplast and mitochondrial genomes simultaneously. PMAT (v1.5.3; Bi et al. 2024) is optimized for the assembly of plant mitochondrial genomes and has not been optimized for chloroplast assembly (supplementary fig. S5, Supplementary Material online). For PMAT, the auto mode was first used with the parameters -tp mt and -tp all, applied separately. Subsequently, the buildgraph mode was applied using the output from the auto mode. For ptGAUL (v1.0.5; Zhou et al. 2023) and CLAW (https://github.com/aaronphillips7493/CLAW; Phillips et al. 2024), which only assemble chloroplast genomes, the chloroplast genome sequences of closely related species were provided and run with default parameters.
Whole-Genome Alignment and Visualization
To compare genomes assembled from different sources, whole-genome alignments were performed with MiniTV (https://github.com/weigelworld/minitv), which uses minimap2 (v2.24-r1122; Li 2018) for alignment, followed by visualization with AliTV (v1.0.6) (https://alitvteam.github.io/AliTV/d3/AliTV.html; Ankenbrand et al. 2017).
Removal of Chloroplast Sequences From Mitochondrial Assemblies
First, we converted the mitochondrial assembly graphs into fragments. Given that the TIPPo chloroplast assembly results are the cleanest and the most complete, we aligned the mitochondrial contigs from PMAT (v1.5.3; Bi et al. 2024) to the TIPPo chloroplast genome using minimap2 (2.24-r1122; Li 2018). Contigs that were covered over >90% of their length by the chloroplast genome and had >95% similarity to it were labeled as “chloroplast.” Using Bandage (v0.9.0; Wick et al. 2015), we colored the nodes identified as chloroplast sequences in green and confirmed their identity after visual inspection. We removed the chloroplast sequences from the mitochondrial assemblies.
Assessing Assembly Completeness
We obtained amino acid sequence files for 41 conserved mitochondrial genes from mitopy (https://github.com/dsenalik/mitofy; Alverson et al. 2010). We used BLASTX (2.9.0+; McGinnis and Madden 2004) to align mitochondrial genome assemblies to each of the 41 genes, using a threshold of 1e−3. Considering that the current mitochondrial assembly results are presented in the format of an assembly graph, where long repeats will be collapsed into a single node, we evaluate gene completeness based on the presence or absence of genes, without accounting for their copy number.
Performance Benchmarking
All organellar genomes were assembled on an AMD EPYC 7742 processor with 64 cores and 1 TB of RAM. Runtime and peak memory usage were calculated using the/usr/bin/time -v command. All the assembly tools were set to run with 40 threads.
NUMT and NUPT Analysis
To identify NUPTs and NUMTs in the nuclear genome, we used BLASTN (2.9.0+; McGinnis and Madden 2004) with the parameters -evalue 1e−5, -dust no, -penalty -2, -word_size 9, and -outfmt 6. We aligned the chloroplast and mitochondrial genomes to their respective nuclear genomes and retained hits with an identity of >80% and a length >100 bp. Considering the redundancy in the BLASTN output, we removed all high-scoring segment pairs (HSPs) completely embedded in longer HSPs. We merged overlapping HSPs with bedtools (v2.31.1; Quinlan and Hall 2010). The identity of the merged interval in the nuclear genome to the organellar genome was calculated as the average of the identities before merging.
To identify substitutions in NUPTs and NUMTs relative to the chloroplast and mitochondrial genomes, we used minimap2 (version 2.24-r1122; Li 2018) with the parameters –paf-no-hit -ax asm5 –cs -r2k to generate alignment files. Finally, we used htsbox (version r345) (https://github.com/lh3/htsbox) with the parameters pileup -q5 -evcf to call variants.
Annotation of siRNA Loci and Overlap With NUPTs/NUMTs
For each of the selected 11 species, we downloaded data from 2 libraries. We used ShortStack (v4.0.4; Axtell 2013) with default parameters to annotate siRNA loci. In short, reads with one or no mismatch were retained, and multimapping reads were assigned to a single location with the U model. GAT (v1.3.5; Heger et al. 2013) was used to test whether the siRNA locus overlaps were greater than expected by chance with the parameter -num-samples = 1,000.
Supplementary Material
Acknowledgments
The authors thank Andrea Movilli, Adrian Contreras, Yueqi Tao, Svitlana Sushko, Li He, and Haim Ashkenazy for the discussions. A.G. acknowledges support by the High Performance and Cloud Computing Group at the Zentrum für Datenverarbeitung of the University of Tübingen, the state of Baden-Württemberg through bwHPC and the German Research Foundation (DFG) through grant no. INST 37/935-1 FUGG, and the de.NBI Cloud within the German Network for Bioinformatics Infrastructure and ELIXIR-DE (Forschungszentrum Jülich and W-de.NBI-001, W-de.NBI-004, W-de.NBI-008, W-de.NBI-010, W-de.NBI-013, W-de.NBI-014, W-de.NBI-016, and W-de.NBI-022) for providing computational resources to carry out software testing in this work. This work was supported by the Max Planck Society and the Novozymes Prize of the Novo Nordisk Foundation (D.W.).
Contributor Information
Wenfei Xian, Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
Ilja Bezrukov, Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
Zhigui Bao, Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
Sebastian Vorbrugg, Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
Anupam Gautam, Algorithms in Bioinformatics, Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany; International Max Planck Research School “From Molecules to Organisms”, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany.
Detlef Weigel, Department of Molecular Biology, Max Planck Institute for Biology Tübingen, 72076 Tübingen, Germany; Institute for Bioinformatics and Medical Informatics, University of Tübingen, 72076 Tübingen, Germany.
Supplementary Material
Supplementary material is available at Molecular Biology and Evolution online.
Author Contributions
W.X. designed the project, conducted the analyses, and wrote the first draft of the manuscript. I.B. set up the computational environment. I.B., Z.B., S.V., and A.G. tested the tool. D.W. supervised the project. W.X. and D.W. prepared the final manuscript with inputs from all authors.
Data Availability
Chloroplast and mitochondrial assembly graphs are available on Figshare at https://doi.org/10.6084/m9.figshare.26362141.v1. TIPPo is available at Github (https://github.com/Wenfei-Xian/TIPP). Code to reproduce results from this paper can be found at Github (https://github.com/Wenfei-Xian/Reproducible_for_TIPP_paper).
References
- Aldrich J, Cherney B, Merlin E, Williams C, Mets L. Recombination within the inverted repeat sequences of the Chlamydomonas reinhardii chloroplast genome produces two orientation isomers. Curr Genet. 1985:9(3):233–238. 10.1007/BF00420317. [DOI] [PubMed] [Google Scholar]
- Alverson AJ, Wei X, Rice DW, Stern DB, Barry K, Palmer JD. Insights into the evolution of mitochondrial genome size from complete sequences of Citrullus lanatus and Cucurbita pepo (Cucurbitaceae). Mol Biol Evol. 2010:27(6):1436–1448. 10.1093/molbev/msq029. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ankenbrand MJ, Hohlfeld S, Hackl T, Förster F. AliTV—interactive visualization of whole genome comparisons. PeerJ Comput Sci. 2017:3:e116. 10.7717/peerj-cs.116. [DOI] [Google Scholar]
- Axtell MJ. ShortStack: comprehensive annotation and quantification of small RNA genes. RNA. 2013:19(6):740–751. 10.1261/rna.035279.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bi C, Shen F, Han F, Qu Y, Hou J, Xu K, Xu L-A, He W, Wu Z, Yin T. PMAT: an efficient plant mitogenome assembly toolkit using low-coverage HiFi sequencing data. Hortic Res. 2024:11(3):uhae023. 10.1093/hr/uhae023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao J, Schneeberger K, Ossowski S, Günther T, Bender S, Fitz J, Koenig D, Lanz C, Stegle O, Lippert C, et al. Whole-genome sequencing of multiple Arabidopsis thaliana populations. Nat Genet. 2011:43(10):956–963. 10.1038/ng.911. [DOI] [PubMed] [Google Scholar]
- Cheng H, Concepcion GT, Feng X, Zhang H, Li H. Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm. Nat Methods. 2021:18(2):170–175. 10.1038/s41592-020-01056-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Darwin Tree of Life Project Consortium . Sequence locally, think globally: the Darwin Tree of Life Project. Proc Natl Acad Sci U S A. 2022:119(4):e2115642118. 10.1073/pnas.2115642118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dierckxsens N, Mardulyn P, Smits G. NOVOPlasty: de novo assembly of organelle genomes from whole genome data. Nucleic Acids Res. 2017:45(4):e18. 10.1093/nar/gkw955. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drouin G, Daoud H, Xia J. Relative rates of synonymous substitutions in the mitochondrial, chloroplast and nuclear genomes of seed plants. Mol Phylogenet Evol. 2008:49(3):827–831. 10.1016/j.ympev.2008.09.009. [DOI] [PubMed] [Google Scholar]
- Exposito-Alonso M, Becker C, Schuenemann VJ, Reiter E, Setzer C, Slovak R, Brachi B, Hagmann J, Grimm DG, Chen J, et al. The rate and potential relevance of new mutations in a colonizing plant lineage. PLoS Genet. 2018:14(2):e1007155. 10.1371/journal.pgen.1007155. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fields PD, Waneka G, Naish M, Schatz MC, Henderson IR, Sloan DB. Complete sequence of a 641-kb insertion of mitochondrial DNA in the Arabidopsis thaliana nuclear genome. Genome Biol Evol. 2022:14(5):evac059. 10.1093/gbe/evac059. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hazkani-Covo E, Zeller RM, Martin W. Molecular poltergeists: mitochondrial DNA copies (numts) in sequenced nuclear genomes. PLoS Genet. 2010:6(2):e1000834. 10.1371/journal.pgen.1000834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- He W, Xiang K, Chen C, Wang J, Wu Z. Master graph: an essential integrated assembly model for the plant mitogenome based on a graph-based framework. Brief Bioinform. 2022:24(1):bbac522. 10.1093/bib/bbac522. [DOI] [PubMed] [Google Scholar]
- Heger A, Webber C, Goodson M, Ponting CP, Lunter G. GAT: a simulation framework for testing the association of genomic intervals. Bioinformatics. 2013:29(16):2046–2048. 10.1093/bioinformatics/btt343. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang CY, Grünheit N, Ahmadinejad N, Timmis JN, Martin W. Mutational decay and age of chloroplast and mitochondrial genomes transferred recently to angiosperm nuclear chromosomes. Plant Physiol. 2005:138(3):1723–1733. 10.1104/pp.105.060327. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Igolkina A, Vorbrugg S, Rabanal F, Liu H-J, Ashkenazy H, Kornienko A, Fitz J, Collenberg M, Kubica C, Morales AM, et al. Towards an unbiased characterization of genetic polymorphism. bioRxiv 596703. 10.1101/2024.05.30.596703, 30 May 2024, preprint: not peer reviewed. [DOI]
- Jin J-J, Yu W-B, Yang J-B, Song Y, dePamphilis CW, Yi T-S, Li D-Z. GetOrganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 2020:21(1):241. 10.1186/s13059-020-02154-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karlicki M, Antonowicz S, Karnkowska A. Tiara: deep learning-based classification system for eukaryotic sequences. Bioinformatics. 2022:38(2):344–350. 10.1093/bioinformatics/btab672. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karnkowska A, Vacek V, Zubáčová Z, Treitli SC, Petrželková R, Eme L, Novák L, Žárský V, Barlow LD, Herman EK, et al. A eukaryote without a mitochondrial organelle. Curr Biol. 2016:26(10):1274–1284. 10.1016/j.cub.2016.03.053. [DOI] [PubMed] [Google Scholar]
- Kokot M, Dlugosz M, Deorowicz S. KMC 3: counting and manipulating k-mer statistics. Bioinformatics. 2017:33(17):2759–2761. 10.1093/bioinformatics/btx304. [DOI] [PubMed] [Google Scholar]
- Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nat Biotechnol. 2019:37(5):540–546. 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
- Ladoukakis ED, Zouros E. Evolution and inheritance of animal mitochondrial DNA: rules and exceptions. J Biol Res (Thessalon). 2017:24(1):1–7. 10.1186/s40709-017-0060-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Law JA, Jacobsen SE. Establishing, maintaining and modifying DNA methylation patterns in plants and animals. Nat Rev Genet. 2010:11(3):204–220. 10.1038/nrg2719. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lewin HA, Richards S, Lieberman Aiden E, Allende ML, Archibald JM, Bálint M, Barker KB, Baumgartner B, Belov K, Bertorelle G, et al. The Earth BioGenome Project 2020: starting the clock. Proc Natl Acad Sci U S A. 2022:119(4):e2115635118. 10.1073/pnas.2115635118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li D. rtrees: an R package to assemble phylogenetic trees from megatrees. Ecography. 2023:2023(7):06643. 10.1111/ecog.06643. [DOI] [Google Scholar]
- Li H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics. 2018:34(18):3094–3100. 10.1093/bioinformatics/bty191. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lunardon A, Johnson NR, Hagerott E, Phifer T, Polydore S, Coruh C, Axtell MJ. Integrated annotations and analyses of small RNA-producing loci from 47 diverse plants. Genome Res. 2020:30(3):497–513. 10.1101/gr.256750.119. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Maumus F, Quesneville H. Ancestral repeats have shaped epigenome and genome composition for millions of years in Arabidopsis thaliana. Nat Commun. 2014:5(1):4104. 10.1038/ncomms5104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McGinnis S, Madden TL. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res. 2004:32(Web Server):W20–W25. 10.1093/nar/gkh435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michalovova M, Vyskot B, Kejnovsky E. Analysis of plastid and mitochondrial DNA insertions in the nucleus (NUPTs and NUMTs) of six plant species: size, relative age and chromosomal localization. Heredity (Edinb). 2013:111(4):314–320. 10.1038/hdy.2013.51. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naish M, Alonge M, Wlodzimierz P, Tock AJ, Abramson BW, Schmücker A, Mandáková T, Jamge B, Lambing C, Kuo P, et al. The genetic and epigenetic landscape of the Arabidopsis centromeres. Science. 2021:374(6569):eabi7489. 10.1126/science.abi7489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Noutsos C, Richly E, Leister D. Generation and evolutionary fate of insertions of organelle DNA in the nuclear genomes of flowering plants. Genome Res. 2005:15(5):616–628. 10.1101/gr.3788705. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, et al. The complete sequence of a human genome. Science. 2022:376(6588):44–53. 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ossowski S, Schneeberger K, Lucas-Lledó JI, Warthmann N, Clark RM, Shaw RG, Weigel D, Lynch M. The rate and molecular spectrum of spontaneous mutations in Arabidopsis thaliana. Science. 2010:327(5961):92–94. 10.1126/science.1180677. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Palmer JD. Chloroplast DNA exists in two orientations. Nature. 1983:301(5895):92–93. 10.1038/301092a0. [DOI] [Google Scholar]
- Phillips AL, Ferguson S, Burton RA, Watson-Haigh NS. CLAW: an automated Snakemake workflow for the assembly of chloroplast genomes from long-read data. PLoS Comput Biol. 2024:20(2):e1011870. 10.1371/journal.pcbi.1011870. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Putintseva YA, Bondar EI, Simonov EP, Sharov VV, Oreshkova NV, Kuzmin DA, Konstantinov YM, Shmakov VN, Belkov VI, Sadovsky MG, et al. Siberian larch (Larix sibirica Ledeb.) mitochondrial genome assembled using both short and long nucleotide sequence reads is currently the largest known mitogenome. BMC Genomics. 2020:21(1):654. 10.1186/s12864-020-07061-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010:26(6):841–842. 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rabanal FA, Gräff M, Lanz C, Fritschi K, Llaca V, Lang M, Carbonell-Bejerano P, Henderson I, Weigel D. Pushing the limits of HiFi assemblies reveals centromere diversity between two Arabidopsis thaliana genomes. Nucleic Acids Res. 2022:50(21):12309–12327. 10.1093/nar/gkac1115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rautiainen M, Marschall T. GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol. 2020:21(1):253. 10.1186/s13059-020-02157-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J, et al. Towards complete and error-free genome assemblies of all vertebrate species. Nature. 2021:592(7856):737–746. 10.1038/s41586-021-03451-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Richly E, Leister D. NUPTs in sequenced eukaryotes and their genomic organization in relation to NUMTs. Mol Biol Evol. 2004a:21(10):1972–1980. 10.1093/molbev/msh210. [DOI] [PubMed] [Google Scholar]
- Richly E, Leister D. NUMTs in sequenced eukaryotic genomes. Mol Biol Evol. 2004b:21(6):1081–1084. 10.1093/molbev/msh110. [DOI] [PubMed] [Google Scholar]
- Rousseau-Gueutin M, Ayliffe MA, Timmis JN. Conservation of plastid sequences in the plant nuclear genome for millions of years facilitates endosymbiotic evolution. Plant Physiol. 2011:157(4):2181–2193. 10.1104/pp.111.185074. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sanchez-Puerta MV, García LE, Wohlfeiler J, Ceriotti LF. Unparalleled replacement of native mitochondrial genes by foreign homologs in a holoparasitic plant. New Phytol. 2017:214(1):376–387. 10.1111/nph.14361. [DOI] [PubMed] [Google Scholar]
- Seppey M, Manni M, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness. Methods Mol Biol. 2019:1962:227–245. 10.1007/978-1-4939-9173-0_14. [DOI] [PubMed] [Google Scholar]
- Sereika M, Kirkegaard RH, Karst SM, Michaelsen TY, Sørensen EA, Wollenberg RD, Albertsen M. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat Methods. 2022:19(7):823–826. 10.1038/s41592-022-01539-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shang L, He W, Wang T, Yang Y, Xu Q, Zhao X, Yang L, Zhang H, Li X, Lv Y, et al. A complete assembly of the rice Nipponbare reference genome. Mol Plant. 2023:16(8):1232–1236. 10.1016/j.molp.2023.08.003. [DOI] [PubMed] [Google Scholar]
- Sigman MJ, Slotkin RK. The first rule of plant transposable element silencing: location, location, location. Plant Cell. 2016:28(2):304–313. 10.1105/tpc.15.00869. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sloan DB, Alverson AJ, Chuckalovcak JP, Wu M, McCauley DE, Palmer JD, Taylor DR. Rapid evolution of enormous, multichromosomal genomes in flowering plant mitochondria with exceptionally high mutation rates. PLoS Biol. 2012:10(1):e1001241. 10.1371/journal.pbio.1001241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soorni A, Haak D, Zaitlin D, Bombarely A. Organelle_PBA, a pipeline for assembling chloroplast and mitochondrial genomes from PacBio DNA sequencing data. BMC Genomics. 2017:18(1):49. 10.1186/s12864-016-3412-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Štorchová H, Krüger M. The overview of methods for assembling complex mitochondrial genomes in land plants. J Exp Bot. 2024:75(17):5169–5174. 10.1093/jxb/erae034. [DOI] [PubMed] [Google Scholar]
- Uliano-Silva M, Ferreira JGRN, Krasheninnikova K; Darwin Tree of Life Consortium, Formenti G, Abueg L, Torrance J, Myers EW, Durbin R, Blaxter M, et al. Mitohifi: a python pipeline for mitochondrial genome assembly from PacBio high fidelity reads. BMC Bioinformatics. 2023:24(1):288. 10.1186/s12859-023-05385-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang J. Plant organellar genomes: much done, much more to do. Trends Plant Sci. 2024:29(7):754–769. 10.1016/j.tplants.2023.12.014. [DOI] [PubMed] [Google Scholar]
- Wang W, Lanfear R. Long-reads reveal that the chloroplast genome exists in two distinct versions in most plants. Genome Biol Evol. 2019:11(12):3372–3381. 10.1093/gbe/evz256. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang X, Weigel D, Smith LM. Transposon variants and their effects on gene expression in Arabidopsis. PLoS Genet. 2013:9(2):e1003255. 10.1371/journal.pgen.1003255. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weng M-L, Becker C, Hildebrandt J, Neumann M, Rutter MT, Shaw RG, Weigel D, Fenster CB. Fine-grained analysis of spontaneous mutation spectrum and frequency in Arabidopsis thaliana. Genetics. 2019:211(2):703–714. 10.1534/genetics.118.301721. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wenger AM, Peluso P, Rowell WJ, Chang P-C, Hall RJ, Concepcion GT, Ebler J, Fungtammasan A, Kolesnikov A, Olson ND, et al. Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome. Nat Biotechnol. 2019:37(10):1155–1162. 10.1038/s41587-019-0217-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wick RR, Schultz MB, Zobel J, Holt KE. Bandage: interactive visualization of de novo genome assemblies. Bioinformatics. 2015:31(20):3350–3352. 10.1093/bioinformatics/btv383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wlodzimierz P, Rabanal FA, Burns R, Naish M, Primetis E, Scott A, Mandáková T, Gorringe N, Tock AJ, Holland D, et al. Cycles of satellite and transposon evolution in Arabidopsis centromeres. Nature. 2023:618(7965):557–565. 10.1038/s41586-023-06062-z. [DOI] [PubMed] [Google Scholar]
- Wolfe KH, Li WH, Sharp PM. Rates of nucleotide substitution vary greatly among plant mitochondrial, chloroplast, and nuclear DNAs. Proc Natl Acad Sci U S A. 1987:84(24):9054–9058. 10.1073/pnas.84.24.9054. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xiong X, Gou J, Liao Q, Li Y, Zhou Q, Bi G, Li C, Du R, Wang X, Sun T, et al. The Taxus genome provides insights into paclitaxel biosynthesis. Nat Plants. 2021:7(8):1026–1036. 10.1038/s41477-021-00963-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoon HS, Hackett JD, Ciniglia C, Pinto G, Bhattacharya D. A molecular timeline for the origin of photosynthetic eukaryotes. Mol Biol Evol. 2004:21(5):809–818. 10.1093/molbev/msh075. [DOI] [PubMed] [Google Scholar]
- Yoshida T, Furihata HY, Kawabe A. Patterns of genomic integration of nuclear chloroplast DNA fragments in plant species. DNA Res. 2014:21(2):127–140. 10.1093/dnares/dst045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yoshida T, Furihata HY, To TK, Kakutani T, Kawabe A. Genome defense against integrated organellar DNA fragments from plastids into plant nuclear genomes through DNA methylation. Sci Rep. 2019:9(1):2060. 10.1038/s41598-019-38607-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang G-J, Dong R, Lan L-N, Li S-F, Gao W-J, Niu H-X. Nuclear integrants of organellar DNA contribute to genome structure and evolution in plants. Int J Mol Sci. 2020:21(3):707. 10.3390/ijms21030707. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhou C, Brown M, Blaxter M; The Darwin Tree of Life Project Consortium, McCarthy SA, Durbin R. Oatk: a de novo assembly tool for complex plant organelle genomes. bioRxiv 619857. 10.1101/2024.10.23.619857, 28 October 2024, preprint: not peer reviewed. [DOI]
- Zhou W, Armijos CE, Lee C, Lu R, Wang J, Ruhlman TA, Jansen RK, Jones AM, Jones CD. Plastid genome assembly using long-read data. Mol Ecol Resour. 2023:23(6):1442–1457. 10.1111/1755-0998.13787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zimorski V, Ku C, Martin WF, Gould SB. Endosymbiotic theory for organelle origins. Curr Opin Microbiol. 2014:22:38–48. 10.1016/j.mib.2014.09.008. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Chloroplast and mitochondrial assembly graphs are available on Figshare at https://doi.org/10.6084/m9.figshare.26362141.v1. TIPPo is available at Github (https://github.com/Wenfei-Xian/TIPP). Code to reproduce results from this paper can be found at Github (https://github.com/Wenfei-Xian/Reproducible_for_TIPP_paper).







