Comparative genomics reveals insight into the evolutionary origin of massively scrambled genomes

Yi Feng; Rafik Neme; Leslie Y Beh; Xiao Chen; Jasper Braun; Michael W Lu; Laura F Landweber

doi:10.7554/eLife.82979

. 2022 Nov 24;11:e82979. doi: 10.7554/eLife.82979

Comparative genomics reveals insight into the evolutionary origin of massively scrambled genomes

Yi Feng ¹, Rafik Neme ^1,², Leslie Y Beh ¹, Xiao Chen ³, Jasper Braun ^4,^†, Michael W Lu ¹, Laura F Landweber ^1,^✉

Editors: Detlef Weigel⁵, Detlef Weigel⁶

PMCID: PMC9797194 PMID: 36421078

Abstract

Ciliates are microbial eukaryotes that undergo extensive programmed genome rearrangement, a natural genome editing process that converts long germline chromosomes into smaller gene-rich somatic chromosomes. Three well-studied ciliates include Oxytricha trifallax, Tetrahymena thermophila, and Paramecium tetraurelia, but only the Oxytricha lineage has a massively scrambled genome, whose assembly during development requires hundreds of thousands of precisely programmed DNA joining events, representing the most complex genome dynamics of any known organism. Here we study the emergence of such complex genomes by examining the origin and evolution of discontinuous and scrambled genes in the Oxytricha lineage. This study compares six genomes from three species, the germline and somatic genomes for Euplotes woodruffi, Tetmemena sp., and the model ciliate O. trifallax. We sequenced, assembled, and annotated the germline and somatic genomes of E. woodruffi, which provides an outgroup, and the germline genome of Tetmemena sp. We find that the germline genome of Tetmemena is as massively scrambled and interrupted as Oxytricha’s: 13.6% of its gene loci require programmed translocations and/or inversions, with some genes requiring hundreds of precise gene editing events during development. This study revealed that the earlier diverged spirotrich, E. woodruffi, also has a scrambled genome, but only roughly half as many loci (7.3%) are scrambled. Furthermore, its scrambled genes are less complex, together supporting the position of Euplotes as a possible evolutionary intermediate in this lineage, in the process of accumulating complex evolutionary genome rearrangements, all of which require extensive repair to assemble functional coding regions. Comparative analysis also reveals that scrambled loci are often associated with local duplications, supporting a gradual model for the origin of complex, scrambled genomes via many small events of DNA duplication and decay.

Research organism: Other, Oxytricha trifallax, Tetmemena, Euplotes woodruffi

Introduction

Organisms do not always contain a single, static genome. Programmed genome editing is a naturally occurring and essential part of development in many organisms, including ciliates (Chen et al., 2014), nematodes (Mitreva et al., 2005), lampreys (Smith et al., 2012), and zebra finches (Biederman et al., 2018). Most of these events involve precise removal and rejoining of large regions of DNA during postzygotic differentiation of a somatic genome from a germline genome. Ciliates are microbial eukaryotes with two types of nuclei: a somatic macronucleus (MAC) that differentiates from a germline micronucleus (MIC). In the model ciliate Oxytricha, the MAC is entirely active chromatin (Beh et al., 2019) and the hub of transcription. The three species that we compare are all spirotrichs, which have gene-sized ‘nanochromosomes’ in the MAC, present at high copy number (Swart et al., 2013; Chen et al., 2015; Wang et al., 2016; Chen et al., 2019; Lindblad et al., 2019; Vinogradov et al., 2012). The diploid MIC participates in sexual reproduction, but its megabase-sized chromosomes are mostly transcriptionally silent.

Gene loci are often arranged discontinuously in the MIC, with short genic segments called macronuclear destined sequences (MDSs), interrupted by stretches of non-coding DNA called internally eliminated sequences (IESs) (Figure 1A). During sexual development, a new MAC genome rearranges from a copy of the zygotic MIC genome. MDSs join in the correct order and orientation, whereas MIC-limited genomic regions undergo programmed deletion, including repetitive elements, intergenic regions, and IESs (Figure 1A). Though analogous to intron splicing, these events occur on DNA. The MDSs for some MAC chromosomes are scrambled if they require translocation or inversion during MAC development (Figure 1A). Pairs of short repeats, called pointers, are present at MDS-IES junctions in both scrambled and nonscrambled loci (Mitcham et al., 1992; Prescott, 1994). Pointer sequences are present twice in the MIC, at the end of MDS n and the beginning of MDS n+1. One copy of the repeat is retained at each MDS-MDS junction in a mature MAC chromosome (Figure 1A). These microhomologous regions help guide MDS recombination, but most are non-unique, and the shortest pointers are just 2 bp. Thousands of long, noncoding template RNAs collectively program MDS joining (Nowacki et al., 2008; Lindblad et al., 2017; Yerlici and Landweber, 2014).

Figure 1. — (A) Diagram of genome rearrangement in *Oxytricha*. Each ciliate cell contains a somatic macronucleus (MAC) and a germline micronucleus (MIC). During development, the MAC genome rearranges from a copy of the MIC genome. (1) Nonscrambled genes rearrange simply by joining consecutive macronuclear destined sequences (MDSs, blue boxes) and removing internal eliminated sequences (IESs, thin lines). (2) Rearrangement of scrambled genes requires MDS translocation and/or inversion. Pointers are microhomologous sequences (colored vertical bars) present in two copies in the MIC and only one copy in the MAC where consecutive MDSs recombine. (B) Comparison of genome rearrangement features of representative ciliates and the non-ciliate *Plasmodium falciparum* as an outgroup (phylogenetic information is based on Parfrey et al., 2011; Bracht et al., 2013). Conclusions from this study are shown in bold. * indicates that some scrambled pointers in *Euplotes woodruffi* are much longer, as discussed in the results. Statistics for pointers ≤30 bp in *E. woodruffi* are shown. Table information derives from the following sources: 1 - Swart et al., 2013; 2 - Lindblad et al., 2019; 3 - Chen et al., 2014; 4 - Chen et al., 2015; 5 - Sheng et al., 2020; 6 - Eisen et al., 2006; 7 - Hamilton et al., 2016; 8 - Aury et al., 2006; 9 - Guérin et al., 2017; 10 - Arnaiz et al., 2012; 11 - Riley and Katz, 2001; 12 - Maurer-Alcalá et al., 2018a; 13 - Katz and Kovner, 2010; 14 - Gao et al., 2014.

Numerous studies have inferred the possible scope of genome rearrangement in different ciliate species using partial genome surveys. In Paramecium, PiggyMac-depleted cells fail to remove MIC-limited regions properly, which provided a resource to annotate ~45,000 IESs prior to assembly of a draft MIC genome (Arnaiz et al., 2012). The use of single-cell sequencing has allowed pilot studies to sample partial MIC genomes of diverse species (Chen et al., 2019; Maurer-Alcalá et al., 2018a; Maurer-Alcalá et al., 2018b; Smith et al., 2020). Alignment of tentative MIC reads to either assembled MAC genomes or single-cell transcriptome data predicts over 20 candidate scrambled loci in two basal ciliates, Loxodes sp. and Blepharisma americanum (Maurer-Alcalá et al., 2018b) and hundreds of candidate loci in the tintinnid Schmidingerella arcuata (Smith et al., 2020). Nearly one-third (31%) of approximately 5000 surveyed transcripts may be scrambled in Chilodonella uncinata (Maurer-Alcalá et al., 2018a, Figure 1B), which has four confirmed cases of scrambled genes (Katz and Kovner, 2010; Gao et al., 2014). Transcriptome-based surveys offer less precise estimates and cannot distinguish RNA splicing. Several computational pipelines have been developed to facilitate the inference of genome rearrangement features by split-read mapping in the absence of complete MIC or MAC reference genomes (Denby Wilkes et al., 2016; Zheng et al., 2020; Feng et al., 2020; Seah et al., 2021). By surveying lighter genome coverage prior to full sequencing, these tools provide partial insight into germline architecture. This helps guide selection of species for full genome sequencing and subsequent construction of complete rearrangement maps between the MIC and MAC genomes. High-quality MIC genome reference assemblies are only currently available for three ciliate genera: Oxytricha (Chen et al., 2014), Tetrahymena (Hamilton et al., 2016), and Paramecium (Guérin et al., 2017; Sellis et al., 2021).

Programmed genome rearrangements in Oxytricha exhibit the highest accuracy and largest scale of any known natural gene-editing system, with exquisite control over hundreds of thousands of precise DNA cleavage/joining events. Accordingly, its germline genome structure is arguably the most complex of any model organism (Chen et al., 2014), requiring programmed deletion of over 90% of the germline DNA during development and massive descrambling of the resulting fragments to construct a new MAC genome of over 18,000 chromosomes (Lindblad et al., 2019). This differs from the distantly related Tetrahymena and Paramecium that both eliminate ~30% of the germline genome (Hamilton et al., 2016; Guérin et al., 2017). Paramecium uses exclusively 2 bp pointers and lacks evidence of any scrambled loci. A small number of scrambled loci (4 confirmed out of 2711 candidates) have been reported in Tetrahymena (Sheng et al., 2020, Figure 1B). Tetrahymena and Paramecium diverged from Oxytricha over 1 billion years ago (Parfrey et al., 2011; Bracht et al., 2013), which leaves a large gap in our understanding of the emergence of complex DNA rearrangements in the Oxytricha lineage.

Open questions include how did the Oxytricha germline genome acquire its high number of IES insertions and how do scrambled loci arise and evolve. Three previous studies tackled these questions at the level of single genes and orthologs, including DNA polymerase α, actin I, and Telomere end-binding protein subunit α (Hogan et al., 2001; Chang et al., 2005; Wong and Landweber, 2006; DuBois and Prescott, 1995). Here, we provide the first comparative genomic analysis of Oxytricha trifallax and two other spirotrichous ciliates, Tetmemena sp. and Euplotes woodruffi. Tetmemena sp. is a hypotrich similar to Tetmemena pustulata, formerly Stylonychia pustulata (Chen et al., 2015), in the same family as O. trifallax (Figure 1B; Chen et al., 2014; Chen et al., 2015). Hypotrichs are noted for the presence of scrambled genes, based on previous ortholog comparisons (Chen et al., 2015; Hogan et al., 2001; Chang et al., 2005; DuBois and Prescott, 1995; Figure 1B). E. woodruffi, together with the hypotrichous ciliates, belong to the class Spirotrichea (Figure 1B). Like hypotrichs, Euplotes also has gene-sized nanochromosomes in the MAC genome (Wang et al., 2016; Chen et al., 2019; Chen et al., 2021), but this outgroup uses a different genetic code (UGA is reassigned to cysteine, Meyer et al., 1991), and little is known about its MIC genome. A partial MIC genome of Euplotes vannus was previously assembled, and it contains highly conserved TA pointers (Chen et al., 2019), consistent with previous observations in Euplotes crassus (Klobutcher and Herrick, 1995). This differs from O. trifallax, which uses longer pointers of varying lengths, with scrambled pointers typically longer than nonscrambled ones (Chen et al., 2014, Figure 1B). This observation suggests that longer pointers may supply more information to facilitate MDS descrambling, sometimes over great distances. Therefore, the preponderance of 2 bp pointers in the other Euplotes species could indicate limited capacity to support scrambled genes, and a partial genome survey of E. vannus concluded that at least 97% of loci are nonscrambled (Chen et al., 2019). Early studies of Euplotes octocarinatus, on the other hand, demonstrated its use of longer pointers (that usually contain TA) (Tan et al., 1999; Wang et al., 2005), suggesting that some members of the Euplotes genus may have the capacity to support complex genome reorganization. To investigate the origin of scrambled genomes, we choose E. woodruffi as an outgroup, because it is closely related to E. octocarinatus (Syberg-Olsen et al., 2016) and feasible to culture in the lab.

This study includes the de novo assemblies of the micronuclear genome of Tetmemena sp. and both genomes of E. woodruffi. The availability of MIC and MAC genomes for both species allows us to annotate and compare their genome rearrangement maps and other key features to each other and to O. trifallax. The MIC genome of Tetmemena is extremely interrupted, like Oxytricha. While the E. woodruffi MIC genome is much more IES-sparse, it contains thousands of scrambled genes, whose architecture we compare to orthologous loci in the other species. We infer that the evolutionary origin of scrambled genes is associated with local duplications, providing strong support for a previously proposed simple evolutionary model requiring only duplication and decay (Gao et al., 2015) that allows for the evolutionary expansion of extremely rearranged chromosome architectures.

Results

Germline genome expansion via repetitive elements

Tetmemena sp. and E. woodruffi were both propagated in laboratory culture from single cells. The E. woodruffi MAC genome was sequenced and assembled from paired-end Illumina reads from whole cell DNA, which is mostly MAC-derived. For comparative analysis, the MAC genome of E. woodruffi was assembled using the same pipeline previously used for Tetmemena sp. (Chen et al., 2015). Because MIC DNA is significantly more sparse than MAC DNA in individual cells (Prescott, 1994), MIC DNA was enriched before sequencing (see Methods); however, this leads to much lower sequence coverage of the MIC than the MAC. Third-generation long reads (Pacific Biosciences and Oxford Nanopore Technologies) were combined with Illumina paired-end reads (Methods, see genome coverage in Supplementary file 1) to construct hybrid genome assemblies for Tetmemena sp. and E. woodruffi. Though the final genome assemblies are still fragmented, often due to transposon or other repetitive insertions at boundaries (Figure 2—figure supplement 1), the current draft assemblies cover most (>90%) MDSs for 89.1% of MAC nanochromosomes in Tetmemena, and for 90.0% of MAC nanochromosomes in E. woodruffi. This allowed us to establish near-complete rearrangement maps for the newly assembled genomes of Tetmemena and E. woodruffi, at a level comparable to the published reference for O. trifallax (Chen et al., 2014), which is appropriate for comparative analysis.

Table 1 shows a comparison of genome features for the three species. The three MAC genomes are similar in size, with most nanochromosomes bearing only one gene. The size distributions of MAC chromosomes are similar for the three species, though slightly shorter for E. woodruffi, consistent with prior observation via gel electrophoresis (Prescott, 1994, Figure 2—figure supplement 2). Like O. trifallax (Swart et al., 2013), the maximum number of genes encoded on one chromosome is 7–8 (Table 1). Surprisingly, the MIC genome sizes differ substantially: the Tetmemena MIC genome assembly is 237 Mbp, nearly half that of Oxytricha. The E. woodruffi MIC genome assembly is even smaller, approximately 172 Mbp (Table 1).

Table 1. Statistics of somatic macronucleus (MAC) and germline micronucleus (MIC) genomes in three species.

	Oxytricha trifallax		Tetmemena sp.		Euplotes woodruffi
	MAC^a,*	MIC^b	MAC^c	MIC^†	MAC^†	MIC^†
Genome size (Mbp)	67.1	496	60.6	237	72.2	172
N50 (bp)	3745	27,807	3339	14,722	2702	44,656
GC%	31.36	28.44	37.05	32.17	36.56	35.31
Number of contigs^‡	22,426	25,720	25,206	28,446	35,099	17,655
Two-telomere contigs	14,225	-	15,802	-	19,061	-
Telomeric contigs	20,336	-	21,165	-	28,294	-
Single-gene telomeric contigs	76.1%	-	75.5%	-	68.5%	-
Maximum number of genes on a telomeric contig	8	-	7	-	8	-

Open in a new tab

a - Swart et al., 2013; b - Chen et al., 2014; c - Chen et al., 2015.

This study used the MAC genome of Oxytricha from Swart et al., 2013 instead of the long-read assembly in Lindblad et al., 2019, because the short MAC genomes in the present study were primarily assembled from Illumina reads, as in Swart et al., 2013. Lindblad et al., 2019 updated Swart et al., 2013 by including nanochromosomes captured in single long reads, which are currently not available for the other two species. The MIC genomes of Tetmemena and E. woodruffi were assembled to a similar N50 as the reference O. trifallax genome (Chen et al., 2014) for comparative analysis.

^†

Data from this study.

^‡

Telomere-bearing element (TBE) transposon contaminants in MAC contigs were removed (Methods). Therefore, 24 Oxytricha MAC contigs and 13 Tetmemena MAC contigs were removed from the published versions.

The expansion of repetitive elements in the Oxytricha lineage may contribute to the difference in MIC genome sizes (Figure 2A–C). Oxytricha has a variety of tranposable elements (TEs) in the MIC, with telomere-bearing elements (TBEs) of the Tc1/mariner family the most abundant (Chen et al., 2014; Chen and Landweber, 2016, Supplementary file 2). A complete TBE transposon contains three open reading frames (ORFs). ORF1 encodes a 42kD transposase with a DDE-catalytic motif. Though present only in the germline, TBEs are so abundant in hypotrichs that some were partially recovered and assembled from whole cell DNA (Chen and Landweber, 2016). The Oxytricha MIC genome contains ~10,000 complete TBEs and ~24,000 partial TBEs, which occupy approximately 15.20% (75 Mbp) of the genome (Figure 2A, Supplementary file 3; Chen et al., 2014; Chen and Landweber, 2016). Tetmemena, on the other hand, has many fewer TBE ORFs and only 48 complete TBEs (Supplementary file 3), comprising 1.83% (4.3 Mbp) of its MIC genome (Figure 2B). E. crassus has also been reported to have an abundant transposon family called Tec elements (Transposon of Euplotes crassus). Like TBEs, each Tec consists of three ORFs, and ORF1 also encodes a transposase from the Tc1/mariner family (Baird et al., 1989; Krikau and Jahn, 1991; Jahn et al., 1993; Jahn et al., 1989; Klobutcher and Herrick, 1997). The ~57 kD ORF2 encodes a tyrosine-type recombinase (Doak et al., 2003), and the 20kD ORF3 has unknown function (Jahn et al., 1993). Using the three ORFs of Tec1 and Tec2 as queries for search, we identified 74 complete Tec elements in E. woodruffi. Collectively, Tec ORFs occupy 3.6 Mbp, corresponding to only 2.1% of the MIC genome (Figure 2C). Notably, the transposase-encoding ORF1 is more abundant than the other two TBE/Tec ORFs in all three ciliates (Supplementary file 3), consistent with its proposed role in DNA cleavage during genome rearrangement in Oxytricha (Nowacki et al., 2009).

Figure 2—figure supplement 1. — (**A–C**) MIC genome categories for (A) *Oxytricha trifallax*, (B) *Tetmemena sp*., and (C) *Euplotes woodruffi. Oxytricha* displays the greatest proportion of repetitive elements (telomere-bearing elements [TBE], other repeats, and tandem repeats) relative to the other species. *Oxytricha* MIC-specific genes were annotated in Chen et al., 2014; Miller et al., 2021. (**D–F**) Phylogenetic analysis of the three TBE open reading frames (ORFs) in *Oxytricha* and *Tetmemena*: (D) 42 kD, (E) 22 kD, and (F) 57 kD, suggest that TBE3 (green) is the ancestral transposon family in *Oxytricha*. For each ORF, 30 protein sequences from each species were randomly subsampled and maximum likelihood trees constructed using PhyML (Guindon et al., 2010).

Oxytricha contains three families of TBEs. TBE3 appears to be the most ancient among hypotrichs, based on previous analysis of limited MIC genome data (Chen and Landweber, 2016). We constructed phylogenetic trees using randomly subsampled TBE sequences for all three ORFs from Oxytricha and Tetmemena (Figure 2D–F). This confirmed that only TBE3 is present in the Tetmemena MIC genome, as proposed in Chen and Landweber, 2016. This also suggests that TBE1 and TBE2 expanded in Oxytricha after its divergence from other hypotrichous ciliates. As illustrated in Figure 2—figure supplement 1, the MIC genome contexts of TBEs in Oxytricha and Tetmemena are similar, with many TE insertions within IESs, consistent with either IESs as hotspots for TE insertion or with the model (Klobutcher and Herrick, 1997) that some TE insertions may have generated IESs, as demonstrated in Paramecium (Sellis et al., 2021; Feng and Landweber, 2021). Subsequent sequence evolution at the edges of IES/MDS pointers (DuBois and Prescott, 1995) can give rise to boundaries that no longer correspond precisely to TBE ends. For further discussion of the conservation of TBE locations, see the section, ‘Oxytricha and Tetmemena share conserved rearrangement junctions’ below.

Additionally, Repeatmodeler/Repeatmasker identified that Oxytricha has more MIC repeats in the ‘Other’ category than Tetmemena or E. woodruffi (Figure 2, subcategories of repeat content in Supplementary file 2). 214 Mbp of the Oxytricha MIC genome (43%, which is greater than 35.9% reported in Chen et al., 2014 that used earlier versions of the software) is considered repetitive (including TBEs, tandem repeats, and other repeats in Figure 2), versus 31.7 Mbp for Tetmemena (13.4%) and 28.5 Mbp (16.8%) for E. woodruffi. Oxytricha’s additional ~180 Mbp in repeat content partially explains the significantly larger MIC genome size of Oxytricha versus the other spirotrich ciliates.

The E. woodruffi genome has fewer IESs

We used the genome rearrangement annotation tool, Scrambled DNA Rearrangement Annotation Protocol (SDRAP, Braun et al., 2022) to annotate the MIC genomes of Oxytricha, Tetmemena, and E. woodruffi (Methods). Consistent with their close genetic distance, the genomes of O. trifallax and Tetmemena have similarly high levels of discontinuity (Figure 3A). We annotated over 215,299 MDSs in Oxytricha and over 215,624 in Tetmemena with similar MDS length distributions (Figure 3A). By contrast, E. woodruffi MDSs are typically longer, which indicates a less interrupted genome (Figure 3A). We compared the number of MDSs between single-copy orthologs for single-gene MAC chromosomes across the three species and found that the orthologs have similar coding sequence (CDS) lengths (Figure 3—figure supplement 1A–B). There is a strong positive correlation between number of MDSs for orthologous genes in Oxytricha and Tetmemena (R²=0.75, Figure 3B). There is no correlation among number of MDSs between orthologs of E. woodruffi and Oxytricha (R²=0.003, Figure 3C), since E. woodruffi orthologs typically contain fewer MDSs.

Figure 3. — (A) Macronuclear destined sequences (MDSs) of *Euplotes woodruffi* are longer compared to *Oxytricha* or *Tetmemena*. (B) Positive correlation between the numbers of MDSs for orthologous genes in *Tetmemena* and in *Oxytricha* for 903 single-gene orthologs. Black line is the function of linear regression (R²=0.75). Red line is y=x. (C) Orthologs in *E. woodruffi* have fewer MDSs compared to *Oxytricha*, with no correlation (R²=0.003). Note that many highly discontinuous genes in *Oxytricha* are IES-less in *E. woodruffi* (present on one MDS). 917 single-gene orthologs are shown. (D) Distribution of pointers on single-gene somatic macronucleus (MAC) chromosomes in *Oxytricha vs*. (E) *E. woodruffi*, with MAC chromosomes oriented in gene direction. Pointers significantly accumulate at the 5’ end of single-gene MAC chromosomes in *E. woodruffi*. (F) Pointer positions on 3684 two-MDS MAC chromosomes demonstrate a preference upstream of the start codon.

Figure 3—figure supplement 1. — (A) Macronuclear destined sequences (MDSs) of *Euplotes woodruffi* are longer compared to *Oxytricha* or *Tetmemena*. (B) Positive correlation between the numbers of MDSs for orthologous genes in *Tetmemena* and in *Oxytricha* for 903 single-gene orthologs. Black line is the function of linear regression (R²=0.75). Red line is y=x. (C) Orthologs in *E. woodruffi* have fewer MDSs compared to *Oxytricha*, with no correlation (R²=0.003). Note that many highly discontinuous genes in *Oxytricha* are IES-less in *E. woodruffi* (present on one MDS). 917 single-gene orthologs are shown. (D) Distribution of pointers on single-gene somatic macronucleus (MAC) chromosomes in *Oxytricha vs*. (E) *E. woodruffi*, with MAC chromosomes oriented in gene direction. Pointers significantly accumulate at the 5’ end of single-gene MAC chromosomes in *E. woodruffi*. (F) Pointer positions on 3684 two-MDS MAC chromosomes demonstrate a preference upstream of the start codon.

The E. woodruffi genome is generally much less interrupted than that of Oxytricha or Tetmemena. 39.9% of MAC nanochromosomes in E. woodruffi lack IESs (IES-less nanochromosomes) compared to only 4.1 and 4.4% in Oxytricha and Tetmemena, respectively. The sparse IES distribution (as measured by plotting pointer distributions) in E. woodruffi displays a curious 5’ end bias on single-gene MAC chromosomes, oriented in gene direction (Figure 3E). A weak 5’ bias is also present in Oxytricha (Figure 3D) and Tetmemena (Figure 3—figure supplement 1C). In addition, E. woodruffi IESs preferentially accumulate in the 5’ UTR, a short distance upstream of start codons (Figure 3F). Notably, the median distance between the 5’ telomere addition site and the start codon in E. woodruffi is just 54 bp for single-gene chromosomes, approximately half that of Oxytricha (Swart et al., 2013).

E. woodruffi has an intermediate level of genome scrambling

Scrambled genome rearrangements exist in all three species, which we report here for the first time in Tetmemena and the early diverged E. woodruffi. Previous studies have described scrambled genes with confirmed MIC-MAC rearrangement maps for a limited species of hypotrichs (Chen et al., 2014; Chen et al., 2015; Hogan et al., 2001; Chang et al., 2005; Wong and Landweber, 2006; DuBois and Prescott, 1995) and Chilodonella (Katz and Kovner, 2010; Gao et al., 2014) but not in Euplotes. Consistent with the phylogenetic placement of Euplotes as an earlier diverged outgroup to hypotrichs (Lynn, 2008; Gao et al., 2016), the E. woodruffi genome is scrambled, but it contains approximately half as many scrambled genes (2429 genes encoded on 1913 chromosomes, or 7.3% of genes), versus 15.6% scrambled in O. trifallax (3613 genes encoded on 2852 chromosomes) and 13.6% in Tetmemena (3371 genes encoded on 2556 chromosomes). The E. woodruffi lineage may therefore reflect an evolutionary intermediate stage between ancestral genomes with only modest levels of genome scrambling and the more massively scrambled genomes of hypotrichs.

We infer that many genes were likely scrambled in the last common ancestor of Oxytricha and Tetmemena, because these two species share approximately half of their scrambled genes (Supplementary file 4). Furthermore, most scrambled genes are not new genes, since they possess at least one ortholog in other ciliate species (Supplementary file 4, Supplementary file 5).

Scrambled genes are associated with local paralogy

Notably, scrambled genes in all three species generally have more paralogs (Figure 4). We identified orthogroups containing genes derived from the same gene in the last common ancestor of the three species (Methods). For each species, orthogroups with at least one scrambled gene are significantly larger than those containing no scrambled genes (p-value <1e−5, Mann-Whitney U test, Figure 4A–C). This association suggests a possible role of gene duplication in the origin of scrambled genes.

Figure 4—figure supplement 1. — Orthogroups containing at least one scrambled gene (‘scrambled’) are larger than orthogroups that lack scrambled genes (‘nonscrambled’) in (A) *Oxytricha*, (B) *Tetmemena,* and (C) *Euplotes woodruffi*.

Scrambled pointers are generally longer than nonscrambled ones in all three species (Figure 3—figure supplement 2), consistent with prior observations (Chen et al., 2014) and the possibility that longer pointers participate in more complex rearrangements, including recombination between MDSs separated by greater distances (Landweber et al., 2000). Scrambled and nonscrambled IESs also differ in their length distribution (Figure 3—figure supplement 2). Curiously, scrambled ‘pointers’ in E. woodruffi can be as long as several hundred base pairs (median 48 bp, average 212 bp) unlike the more typical 2–20 bp canonical pointers. These long ‘pointers’ in E. woodruffi are more likely partial MDS duplications (Figure 4—figure supplement 1A). We also identified MDSs that map to two or more paralogous regions within the same MIC contig (Supplementary file 6), therefore representing MDS duplications and not alleles. Such paralogous regions could be alternatively incorporated into the rearranged MAC product. Moreover, we find that, for all three species, there are significantly more scrambled chromosomes than nonscrambled MAC chromosomes that contain at least one paralogous MDS (chi-square test, p-value <1e−10; Supplementary file 6). An example is shown in Figure 4—figure supplement 1A (MDS 7 and 7').

The presence of paralogous MDSs can contribute to the origin of scrambled rearrangements, as proposed in an elegant model by Gao et al., 2015; illustrated in Figure 4—figure supplement 1B. The model proposes that initial MDS duplications permit alternative use of either MDS copy into the mature MAC chromosome. As mutations accumulate in redundant paralogs, cells that incorporate the least decayed MDS regions into the MAC gene would have both a fitness advantage and a better match to the template RNA (Nowacki et al., 2008) that guides rearrangement, thus increasing the likelihood of incorporation into the MAC chromosome. The paralogous regions containing more mutations would gradually decay into IESs, and scrambled pointers eventually be reduced to a shorter length. The extended length ‘pointers’ that we identified in E. woodruffi may reflect an intermediate stage in the origin of scrambled genes (Figure 4—figure supplement 1B).

This model may generally explain the abundance and expansion of ‘odd-even’ patterns in ciliate scrambled genes (Landweber et al., 2000; Burns et al., 2016). As illustrated in Figure 4—figure supplement 1A, the even- and odd-numbered MDSs for many scrambled genes derive from different MIC genome clusters. The model predicts that the IES between MDS n−1 and n+1 often derives from ancestral duplication of a region containing MDS n (Figure 4—figure supplement 2A). To test this hypothesis explicitly, we extracted from all odd-even scrambled loci in the three species all sets of corresponding MDS/IES pairs that are flanked by identical pointers on both sides, i.e., all pairs of scrambled MDSs and IESs, where the IES between MDS n−1 and n+1 is directly exchanged for MDS n during DNA rearrangement (S1 and S2 in Figure 4—figure supplement 2A). To exclude the possibility of alleles confounding this analysis, MDS and IES pairs were only considered if they map to the same MIC contig. In E. woodruffi, the lengths of these MDS/IES pairs strongly correlate (Spearman correlation ρ=0.755, p<1e−5, Figure 4—figure supplement 2B). Moreover, many MDS and IES sequence pairs also share sequence similarity, consistent with paralogy: for 248 MDS-IES pairs of similar length, 90.3% share a core sequence with ~97.5% identity across 8–100% of both the IES and MDS length. The lowest end of these observations is also compatible with an alternative model (Chang et al., 2005) in which direct recombination between IESs and MDSs at short repeats can lead to expansion of odd-even patterns. For Oxytricha and Tetmemena, the MDS and IES lengths for such MDS/IES pairs also display a weakly-positive correlation (p-values and Spearman correlation ρ shown in Figure 4—figure supplement 2D–E). Remarkably, the odd-even-containing loci that are species-specific, and therefore became scrambled more recently, have the strongest length correlation (Figure 4—figure supplement 2C–E) and more pairs that display sequence similarity (Supplementary file 7) relative to older loci (scrambled in two or more species). This result is consistent with an evolutionary process in which mutations accumulate in one copy of the MDS, gradually obscuring its sequence homology and ability to be incorporated as a functional MDS, and eventually its ability to be recognized by the template RNAs that guide DNA rearrangement. This analysis also suggests that most of the odd-even scrambled loci in E. woodruffi arose recently, because there is greater sequence similarity between MDSs and the corresponding IESs that they replace. Conversely, we infer that most loci that are scrambled in both Oxytricha and Tetmemena became scrambled earlier in evolution, since they display weaker sequence similarity between exchanged MDS and IES regions.

Scrambled and nonscrambled genes display nearly identical expression support (the presence of at least one read in all three replicates) in both Oxytricha (Supplementary file 8) and Tetmemena. E. woodruffi has slightly more expression support for nonscrambled vs. scrambled genes (Figure 4—figure supplement 3), which could be explained by more recent acquisition of thousands of scrambled loci in E. woodruffi. In some of those cases the nonscrambled paralogs may still contribute the major function. The distribution of expression levels is similar for scrambled vs. nonscrambled genes in all three species, supporting their authenticity (Figure 4—figure supplement 3), although in a Mann-Whitney U test, the average expression level of three replicates is significantly higher in nonscrambled genes for Oxytricha and E. woodruffi, but not significant for Tetmemena.

Oxytricha and Tetmemena share conserved DNA rearrangement junctions

To understand the conservation of genome rearrangement patterns, we developed a pipeline guided by protein sequence alignment to compare pointer positions for orthologous genes between any two species (Methods, Figure 5A). We compared pointers for 2503 three-species single-copy orthologs. 4448 pointer locations are conserved between Oxytricha and Tetmemena on 1345 ortholog pairs (Supplementary file 9), representing 38.3% of pointers in these orthologs in Oxytricha and 30.9% in Tetmemena. For Oxytricha/E. woodruffi and Tetmemena/E. woodruffi comparisons, 56 and 58 pointer pairs are conserved, respectively. We also identified 23 pointer locations shared among all three species (Supplementary file 9, Figure 5B, Figure 5—source data 1).

Figure 5. — (A) Pipeline for comparison of pointer positions in orthologs. Orthologs are first grouped by OrthoFinder (Emms and Kelly, 2019), and protein sequences of single-copy orthologs aligned by Clustal Omega (Sievers et al., 2011). Then the protein alignments are reverse translated to coding sequence (CDS) alignments by a modified script of pal2nal (105, Methods). Pointers are annotated on the CDS alignments for comparison between any two orthologs. (B) Two examples of pointer conservation across three species. Gray lines represent the alignment of orthologous CDS regions, and boxes show magnified regions containing conserved pointers. The top panel shows a conserved scrambled pointer (*Oxytricha*: Contig889.1.g68; *Tetmemena*: LASU02015390.1.g1; *Euplotes woodruffi*: EUPWOO_MAC_30,105 .g1). The bottom panel shows a conserved nonscrambled pointer (*Oxytricha*: Contig19750.0.g98; *Tetmemena*: LASU02002033.1.g1; *E. woodruffi*: EUPWOO_MAC_31,621 .g1). Pointer sequences are noted, and commas indicate reading frame. Protein domains detected by HMMER (Finn et al., 2011) are marked in purple. (C) Examples of telomere-bearing element (TBE) insertions in nonscrambled internally eliminated sequences. The upper pair of sequences shows an *Oxytricha* TBE pointer (orange insertion of an incomplete TBE2 transposon containing the 42-kD and 57-kD open reading frames) conserved with a *Tetmemena* non-TBE pointer (*Oxytricha*: Contig736.1.g130; *Tetmemena*: LASU02012221.1.g1). Both species have a TA pointer at this junction. The bottom pair of sequences illustrates a case of nonconserved TBE pointers (*Oxytricha*: Contig17579.0.g71; *Tetmemena*: LASU02007616.1.g1).

Figure 5—source data 1. Pointers conserved in all three species.

elife-82979-fig5-data1.xlsx^{(12.3KB, xlsx)}

Figure 5—source data 2. The telomere-bearing element (TBE) pointers in *Oxytricha* that are conserved with non-TBE pointers in *Tetmemena*.

elife-82979-fig5-data2.xlsx^{(14.7KB, xlsx)}

Figure 5—figure supplement 1. — (A) Pipeline for comparison of pointer positions in orthologs. Orthologs are first grouped by OrthoFinder (Emms and Kelly, 2019), and protein sequences of single-copy orthologs aligned by Clustal Omega (Sievers et al., 2011). Then the protein alignments are reverse translated to coding sequence (CDS) alignments by a modified script of pal2nal (105, Methods). Pointers are annotated on the CDS alignments for comparison between any two orthologs. (B) Two examples of pointer conservation across three species. Gray lines represent the alignment of orthologous CDS regions, and boxes show magnified regions containing conserved pointers. The top panel shows a conserved scrambled pointer (*Oxytricha*: Contig889.1.g68; *Tetmemena*: LASU02015390.1.g1; *Euplotes woodruffi*: EUPWOO_MAC_30,105 .g1). The bottom panel shows a conserved nonscrambled pointer (*Oxytricha*: Contig19750.0.g98; *Tetmemena*: LASU02002033.1.g1; *E. woodruffi*: EUPWOO_MAC_31,621 .g1). Pointer sequences are noted, and commas indicate reading frame. Protein domains detected by HMMER (Finn et al., 2011) are marked in purple. (C) Examples of telomere-bearing element (TBE) insertions in nonscrambled internally eliminated sequences. The upper pair of sequences shows an *Oxytricha* TBE pointer (orange insertion of an incomplete TBE2 transposon containing the 42-kD and 57-kD open reading frames) conserved with a *Tetmemena* non-TBE pointer (*Oxytricha*: Contig736.1.g130; *Tetmemena*: LASU02012221.1.g1). Both species have a TA pointer at this junction. The bottom pair of sequences illustrates a case of nonconserved TBE pointers (*Oxytricha*: Contig17579.0.g71; *Tetmemena*: LASU02007616.1.g1).

Figure 5—source data 1. Pointers conserved in all three species.

elife-82979-fig5-data1.xlsx^{(12.3KB, xlsx)}

Figure 5—source data 2. The telomere-bearing element (TBE) pointers in *Oxytricha* that are conserved with non-TBE pointers in *Tetmemena*.

elife-82979-fig5-data2.xlsx^{(14.7KB, xlsx)}

To test if these pointer locations are genuinely conserved versus coincidental matching by chance, we performed a Monte Carlo simulation, as also used to study intron conservation (Rogozin et al., 2003). We randomly shuffled pointer positions on CDS regions 1000 times and counted the number of conserved pointer pairs expected for each simulation (Methods). Of the 1000 simulations, none exceeded the observed number of conserved pointer pairs between Oxytricha and Tetmemena (p-value <0.001), suggesting evolutionary conservation of pointer positions (Supplementary file 9). A similar result was obtained for pointers conserved in all three species (Supplementary file 9). However, the numbers of pointer pairs conserved between Oxytricha/E. woodruffi and Tetmemena/E. woodruffi is similar to the expectations by chance (Supplementary file 9). The low level of pointer conservation of either hypotrichs with E. woodruffi may reflect the smaller number of IESs in E. woodruffi; hence, most pointers would have arisen in the hypotrich lineage. Furthermore, E. woodruffi is genetically more distant from the two hypotrichs; hence, the accumulation of substitutions would obscure protein sequence homology, which we used to compare pointer locations. For ortholog pairs between Oxytricha and Tetmemena, scrambled pointers are significantly more conserved than nonscrambled ones (chi-square test, p-value <1e−10, Supplementary file 10). We also find that most pointer sequences differ even if the positions are conserved (Figure 5B, Figure 5—source data 1, Supplementary file 11), suggesting that substitutions may accumulate in pointers without substantially altering rearrangement boundaries.

Oxytricha and Tetmemena both contain a high copy number of TBE transposons (Chen et al., 2014; Chen and Landweber, 2016; Supplementary file 3). We investigated the level of TBE conservation between these two species. To identify orthologous insertions, we focus on TBE insertions in nonscrambled IESs on single-copy orthologs, which include 1706 Oxytricha TBEs inserted in 1296 nonscrambled IESs (multiple TBEs can be inserted into an IES) and 180 Tetmemena TBEs inserted into 170 nonscrambled IESs. We refer to the pointer flanking a TBE-containing IES as a TBE pointer. No TBE pointer locations are conserved between two species. This suggests that TBEs might invade the genomes of Oxytricha and Tetmemena independently, or still be actively mobile in the genome. Only 27 Oxytricha TBE pointers (containing 36 TBEs) are conserved with non-TBE pointers in Tetmemena (Figure 5—source data 2, Figure 5C). No Tetmemena TBE pointer is conserved with an Oxytricha non-TBE pointer. This suggests that TBE insertions may preferentially produce new rearrangement junctions instead of inserting into an existing IES.

Intron locations sometimes coincide with DNA rearrangement junctions

Ciliate genomes are generally intron-poor. Oxytricha averages 1.7 introns/gene, Tetmemena has 1.1, and E. woodruffi has 2.2. Among three-species orthologs, intron locations sometimes map near pointer positions (within a 20-bp window, Figure 5B, Figure 5—figure supplement 1). IESs and introns are both noncoding regions that are removed from mature transcripts, though at different stages. A previous single-gene study observed that an IES in Paraurostyla overlaps the position of an intron in Uroleptus, Urostyla, and also the human homolog (Chang et al., 2005). This observation suggested an intron-IES conversion model in which the ability to eliminate non-CDS regions as either DNA or RNA provides a potential backup mechanism. Such interconversion has also been observed between two strains of Stylonychia (Möllenbeck et al., 2006). In the present study, we identified 174 potential cases of intron-IES conversion in the three species (Figure 5—figure supplement 1, Supplementary file 12): 103 (59.2%) E. woodruffi introns map near Oxytricha/Tetmemena pointers. We used a 20-bp window for this analysis, since one would only expect the boundaries of introns and IESs to coincide precisely if they were recent evolutionary conversions. A Monte Carlo simulation for these intron-IES comparisons (Supplementary file 12) revealed that p<0.001 for most three-species comparisons. For two-species comparisons, we identify 306 cases where an intron boundary in one species precisely coincides with a pointer sequence in another species, with strongest statistical support for the comparison between Oxytricha intron positions and Tetmemena IES junctions (p=0.008) (Supplementary file 13). Notably, Tetmemena intron locations rarely coincide with Oxytricha IESs (Supplementary file 13), suggesting a possible bias in the direction of intron-IES conversion during evolution.

The observation that E. woodruffi has the most introns but the smallest number of IESs per gene (Figure 3) is consistent with removal of intragenic non-CDS regions as either DNA or RNA. The intron-sparseness of ciliates is compatible with a hypothesis that it is advantageous to eliminate noncoding regions earlier at the DNA level, with intron deletion sometimes providing an opportunity for repair if they fail to be excised as IESs (Chang et al., 2005).

Evolution of complex genome rearrangements: Russian doll genes

Genome rearrangements in the Oxytricha lineage can include overlapping and nested loci, with MDSs for different MAC loci embedded in each other (Chen et al., 2014; Braun et al., 2018). When multiple gene loci are nested in each other, these have been called Russian doll loci (Braun et al., 2018). Oxytricha contains two loci with five or more layers of nested genes (Braun et al., 2018). Oxytricha and Tetmemena display a high degree of synteny and conservation in both Russian doll loci. In the first Russian doll gene cluster, one nested gene (green) is present in Oxytricha but absent in Tetmemena (Figure 6A, Figure 6—figure supplement 1, Figure 6—figure supplement 2), confirmed by PCR (Methods). Oxytricha also has a complete TBE3 insertion in the green gene (Figure 6A, Figure 6—figure supplement 1A), hinting at a possible link between transposition and new gene insertion. In addition, a two-gene chromosome in Oxytricha (orange) is present as two single-gene chromosomes in Tetmemena (Figure 6A, Figure 6—figure supplement 1). In Oxytricha, seven orange MDSs ligate across two other loci via an 18-bp pointer (TATATCTATACTAAACTT) to form a two-gene nanochromosome. However, in Tetmemena, telomeres are added to the ends of both gene loci instead, forming two independent MAC chromosomes (Figure 6A, Figure 6—figure supplement 1). The second Russian doll locus has an example of a long, conserved pointer (orange dotted line) that bridges three other loci (the green and blue scrambled loci and one nonscrambled locus, Figure 6B). Close to this region is a decayed TBE insertion (769 bp) in Oxytricha. None of the E. woodruffi orthologs of both Russian doll loci maps to the same MIC contig, which suggests that the Russian doll clusters arose after the divergence of Euplotes from the common ancestor of Oxytricha and Tetmemena.

Figure 6. — (A) Schematic comparison of the Russian doll gene cluster on *Oxytricha* germline micronucleus (MIC) contig OXYTRI_MIC_87484 vs. *Tetmemena* MIC contig TMEMEN_MIC_21461. Boxes of the same color represent clusters of macronuclear destined sequences (MDSs) for orthologous genes (detailed map in Figure 6—figure supplement 1 and Figure 6—figure supplement 2). Numbers in brackets indicate the number of MDSs in each cluster, grouped by somatic macronucleus (MAC) chromosome. One nested gene (green) in *Oxytricha* is absent from *Tetmemena*. A two-gene chromosome (orange) that derives from seven MDSs in *Oxytricha* is processed as two single-gene chromosomes in *Tetmemena* instead (indicated by black border around orange boxes). The purple gene in *Oxytricha* has two paralogs in *Tetmemena*. Black triangles represent conserved, orthologous, and nonscrambled gene loci inserted between nested Russian doll genes. Empty triangle represents scrambled MDSs for other loci. Gray triangles, complete nonscrambled MAC loci embedded between gene layers in one species with no orthologous gene detected in the other species. Black star, a complete telomere-bearing element (TBE) transposon insertion. Gray star, a partial TBE insertion. (B) *Oxytricha* MIC contig OXYTRI_MIC_69233 vs. *Tetmemena* MIC contig TMEMEN_MIC_22886. Pointer sequences bridging the nested MDSs of orange and green genes are highlighted. The underlined pointer portions are conserved between species, e.g., the last 8 bp of the *Oxytricha* pointer, TAAGTTCAAAGTAG, is identical to the first 8 bp of *CAAAGTAGCTCAATC* in *Tetmemena*, illustrating pointer sliding (DuBois and Prescott, 1995), or gradual shifting of MDS/IES boundaries. White star indicates a decayed TBE with no open reading frame identified.

Figure 6—figure supplement 1. — (A) Schematic comparison of the Russian doll gene cluster on *Oxytricha* germline micronucleus (MIC) contig OXYTRI_MIC_87484 vs. *Tetmemena* MIC contig TMEMEN_MIC_21461. Boxes of the same color represent clusters of macronuclear destined sequences (MDSs) for orthologous genes (detailed map in Figure 6—figure supplement 1 and Figure 6—figure supplement 2). Numbers in brackets indicate the number of MDSs in each cluster, grouped by somatic macronucleus (MAC) chromosome. One nested gene (green) in *Oxytricha* is absent from *Tetmemena*. A two-gene chromosome (orange) that derives from seven MDSs in *Oxytricha* is processed as two single-gene chromosomes in *Tetmemena* instead (indicated by black border around orange boxes). The purple gene in *Oxytricha* has two paralogs in *Tetmemena*. Black triangles represent conserved, orthologous, and nonscrambled gene loci inserted between nested Russian doll genes. Empty triangle represents scrambled MDSs for other loci. Gray triangles, complete nonscrambled MAC loci embedded between gene layers in one species with no orthologous gene detected in the other species. Black star, a complete telomere-bearing element (TBE) transposon insertion. Gray star, a partial TBE insertion. (B) *Oxytricha* MIC contig OXYTRI_MIC_69233 vs. *Tetmemena* MIC contig TMEMEN_MIC_22886. Pointer sequences bridging the nested MDSs of orange and green genes are highlighted. The underlined pointer portions are conserved between species, e.g., the last 8 bp of the *Oxytricha* pointer, TAAGTTCAAAGTAG, is identical to the first 8 bp of *CAAAGTAGCTCAATC* in *Tetmemena*, illustrating pointer sliding (DuBois and Prescott, 1995), or gradual shifting of MDS/IES boundaries. White star indicates a decayed TBE with no open reading frame identified.

Discussion

The highly diverse ciliate clade provides a valuable resource for evolutionary studies of genome rearrangement. However, full assembly and annotation of germline MIC genomes have concentrated on the model ciliates Tetrahymena, Paramecium, and Oxytricha. To provide insight into genome evolution in this lineage, we assembled and compared germline and somatic genomes of Tetmemena sp. and an outgroup, E. woodruffi, to that of O. trifallax. This expands our knowledge of the diversity of ciliate genome structures and the evolutionary origin of complex genome rearrangements.

Dramatic variation in transposon copy number (TBE and Tec elements) from the Tc1/mariner family appears to explain most of the variation in MIC genome size. In many eukaryotic taxa, genome size can differ dramatically even for closely related species, a phenomenon known as the ‘C-value paradox’ (Thomas, 1971). Our present observations are compatible with previous reports that the repeat content of the genome, especially transposon content, positively correlates with genome size (Elliott and Gregory, 2015).

Oxytricha has three TBE families in the MIC genome, but only TBE3 is present in Tetmemena, consistent with our previous conclusion that TBE3 is ancestral to the base of the transposon lineage in hypotrichous ciliates (Chen and Landweber, 2016). Tens of thousands of TBE1/2 transposons then expanded specifically in Oxytricha. Despite a high copy number of TBEs in both Oxytricha and Tetmemena, we find no identical TBE locations in nonscrambled IESs, even among syntenic Russian doll regions. These observations suggest that TBEs may be active in these genomes and contribute to the evolution of genome structure.

In the relatively IES-poor genome of E. woodruffi, IESs accumulate upstream of start codons, similar to the 5’ bias of introns in intron-poor organisms (Mourier and Jeffares, 2003). The simplest model to explain 5’ intron bias is homologous recombination between a reverse transcript of an intron-lacking mRNA and the original DNA locus to erase introns in the coding region (Mourier and Jeffares, 2003). A similar mechanism could simultaneously erase IESs in coding regions via germline recombination between the MIC chromosome and a reverse transcript; however, they are usually in different subcellular locations. More plausibly, a source for DNA recombination could be a MAC nanochromosome, since they are already abundant at high copy number, but another source could be by capture of a reverse transcript of a long non-coding template RNA that guides DNA rearrangement (Nowacki et al., 2008; Lindblad et al., 2017). Either recombination event in the germline would lead to loss of IESs, while retaining introns, but neither would necessarily provide a bias for IES-loss in coding regions. Any of these infrequent events would be meaningful on an evolutionary time scale, even if developmentally rare. The 5’ bias of IESs could also reflect an evolutionary bias for continuous coding regions. Alternatively, upstream IESs might regulate gene expression or cell growth (Sellis et al., 2021), like some introns (Parenteau et al., 2019; Morgan et al., 2019).

This study investigated the evolution of scrambled genes by comparing Oxytricha and Tetmemena to E. woodruffi, as an earlier diverged representative of the spirotrich lineage. While E. woodruffi has approximately half as many scrambled genes as Tetmemena and Oxytricha, its genes are also much more continuous. For example, the most scrambled gene in E. woodruffi, encoding a DNA replication licensing factor (EUPWOO_MAC_28518, 3 kb), has only 20 scrambled junctions. The most scrambled gene in Tetmemena (LASU02015934.1, 14.7 kb, encoding a hydrocephalus-inducing-like protein) has 204 scrambled pointers, and the most scrambled gene in Oxytricha (Contig17454.0, 13.7 kb, encoding a dynein heavy chain family protein, Chen et al., 2014) is similarly complex, with 195 scrambled junctions. Together, these observations are consistent with our interpretation that E. woodruffi reflects an evolutionary intermediate stage, as it contains both fewer scrambled loci and fewer scrambled junctions within its scrambled loci. The observation that the most scrambled locus differs in each species is also consistent with the conclusion that complex gene architectures may continue to elaborate independently.

We observed that scrambled genes in each species tend to have more paralogs than nonscrambled genes. Similarly, in C. uncinata (Maurer-Alcalá et al., 2018a), a distantly related ciliate in the class Phyllopharyngea that also has scrambled genes, scrambled gene families (orthogroups) contain more genes (~2.9) than nonscrambled gene families (~1.3) (Maurer-Alcalá et al., 2018a). Apart from duplications at the gene level, E. woodruffi often contains partial MDS duplications at scrambled junctions, annotated as unusually long ‘pointers’ (Figure 4—figure supplement 1). We also demonstrate that odd-even scrambled patterns could readily arise from local duplications (Figure 4—figure supplement 2). These observations are most consistent with a simple model (Gao et al., 2015) in which local duplications permit combinatorial DNA recombination between paralogous germline regions, and mutation accumulation in either paralogs establishes an odd-even scrambled pattern that can propagate by weaving together segments from paralogous sources. Other proposed models include Hoffman and Prescott, 1997 IES-invasion model that suggested that pairs of IESs could invade an MDS, and then subsequently recombine with another IES to yield odd-even scrambled regions; however, a previous examination did not find support for this model (Chang et al., 2005). Prescott et al., 1998 also proposed that some odd-even scrambled loci could arise suddenly via reciprocal recombination with loops of A/T-rich DNA, but this does not exploit paralogy, only the high A/T content in the MIC. We previously proposed a gradual model (Chang et al., 2005; Landweber et al., 2000) in which MDS/IES recombination at short AT-rich repeats (precursors to pointers) could generate and propagate odd-even scrambled patterns. While limited comparisons of orthologs favored the stepwise recombination models (Hogan et al., 2001; Chang et al., 2005; Wong and Landweber, 2006; DuBois and Prescott, 1995), none of the earlier models accounted for the widespread existence of partial paralogy, revealed by genome assemblies.

Local duplications provide a buffer against mutations, allowing paralogous MDSs to repair the MAC locus during assembly of odd/even scrambled genes. Therefore, once an odd/even scrambled locus is established, a consequence is that evolution can only proceed in the direction of accumulating more scrambled junctions, as each new mutation in one paralog necessitates repair via incorporation of the other paralog (Figure 4—figure supplement 1B). This shortens the length of the respective MDSs and increases the number of recombination junctions, creating an evolutionary ratchet that drives the increase in scrambling. The lack of the presence of an error-free, continuous version of this locus in the germline reduces the possibility of losing the scrambled pattern from the MIC genome, relative to the trend toward decreasing MDS lengths as more mutations accumulate in either paralogs, with a resulting increase in the levels of scrambling and fragmentation (Landweber, 2007; Speijer, 2008). The only opportunity to repair a scrambled locus in the MIC would be a rare event that replaces the locus via recombination with a continuous version from the parental MAC, with the source being either parental MAC DNA or a reverse transcript of a template RNA (Nowacki et al., 2008; Lindblad et al., 2017), as discussed above.

Recent exciting reports have also described scrambled genomes in metazoa, including cephalopods (Schmidbaur et al., 2022; Albertin et al., 2022), but those events entail primarily evolutionary shuffling of gene order, without accompanying genome editing or repair. The ciliate lineage is remarkable in having evolved a sophisticated mechanism of RNA-guided genome editing that allows accurate and precise DNA repair of translocations and inversions. The future opportunity to harness this system to develop novel tools for genome editing outside of Oxytricha offers exciting directions.

Methods

DNA collection and sequencing of Tetmemena sp.

Tetmemena sp. (strain SeJ-2015; Chen et al., 2015) was isolated as a single cell from a stock culture and propagated as a clonal strain via vegetative (asexual) cell culture. Cells were cultured in Pringsheim media (0.11 mM Na₂HPO₄, 0.08 mM MgSO₄, 0.85 mM Ca(NO₃)₂, 0.35 mM KCl, pH 7.0) and fed with Chlamydomonas reinhardtii, together with 0.1%(v/v) of an overnight culture of non-virulent Klebsiella pneumoniae. Macronuclei and micronuclei were isolated using sucrose gradient centrifugation (Lauth et al., 1976). Genomic DNA was subsequently purified using the Nucleospin Tissue Kit (Takara Bio USA, Inc). Macronuclear DNA was sequenced and assembled in Chen et al., 2015. Micronuclear DNA was further size-selected via BluePippin (Sage Science) for PacBio sequencing, or via 0.6% (w/v) SeaKem Gold agarose electrophoresis (Lonza) for Illumina sequencing. Micronuclear DNA purification and sequencing protocols are described in Chen et al., 2014.

DNA collection and sequencing for E. woodruffi

E. woodruffi (strain Iz01) was cultured in Volvic water at room temperature and fed with green algae every 2–3 days. We fed cells with C. reinhardtii for MAC DNA collection, and switched to Chlorogonium capillatum for MIC DNA collection. In order to remove algal contamination, cells were starved for at least 2–3 days before collection. Cells were washed and concentrated as in Chen et al., 2014. Because MAC DNA is predominant in whole cell DNA, we used whole cell DNA (purified via NucleoSpin Tissue kit, Takara Bio USA, Inc) for MAC genome sequencing. Paired-end sequencing was performed on an Illumina Hiseq2000 at the Princeton University Genomics Core Facility.

MIC DNA was enriched from whole cell DNA and sequenced via three sequencing platforms (Illumina, Pacific Biosciences, and Oxford Nanopore Technologies). We used conventional and pulse-field gel electrophoresis (PFGE) to enrich MIC DNA:

High-molecular-weight DNA was separated from whole cell DNA by gel-electrophoresis (0.25% agarose gel at 4°C, 120 V for 4 hr). The top band was cut from the gel and purified with the QIAGEN QIAquick kit. The purified high-molecular-weight DNA was directly sent to the group of Dr. Robert Sebra at the Icahn School of Medicine at Mount Sinai for library construction and sequencing. BluePippin (Sage Science) separation was used before sequencing to select DNA >10 kb. DNA was sequenced on two platforms: Illumina HiSeq2500 (150 bp paired-end reads) and PacBio Sequel (SMRT reads).
High-molecular-weight DNA was also enriched by PFGE. E. woodruffi cells were mixed with 1% low-melt agarose to form plugs according to Akematsu et al., 2017, with addition of 1 hr incubation with 50 μg/ml RNase (Invitrogen AM2288) in 10 mM Tris-HCl (pH7.5) at 37°C for RNA depletion. After three washes of 1 hr with 1× TE buffer, the DNA plugs were incubated in 1 mM phenylmethylsulfonyl fluoride (PMSF) to inactivate proteinase K, followed by MspJI (New England Biolabs) digestion at ^mCNNR(9/13) sites to remove contaminant DNA (^mC indicates C5-methylation or C5- hydroxymethylation). Previous reports have shown that no methylcytosine is detectable in vegetative cells of Oxytricha (Bracht et al., 2012), Tetrahymena (Gorovsky et al., 1973), and Paramecium (Cummings et al., 1974), suggesting that C5-methylation and C5-hydroxymethylation are rarely involved in the vegetative growth of the ciliate lineage. We also validated by qPCR that the quantity of two randomly selected MIC loci is not changed after the MspJI digestion. On the contrary, algal genomic DNA is significantly digested by MspJI. Based on these results, we conclude that MspJI digestion can be used to remove bacterial and algal DNA with C5-methylation and C5-hydroxymethylation, leaving E. woodruffi MIC DNA intact. The agarose plugs containing digested DNA were then inserted into wells of 1.0% Certified Megabase agarose gel (Bio-Rad) for PFGE (CHEF-DR II System, Bio-Rad). The DNA was separated at 6 V, 14°C with 0.5× TBE buffer at a 120° angle for 24 hr with switch time of 60–120 s. We validated by qPCR that the E. woodruffi MIC chromosomes were not mobilized from the well, while the MAC DNA migrated into the gel. The MIC DNA was then extracted by phenol-chloroform purification. Library preparation and sequencing were performed at Oxford Nanopore Technologies (New York, NY).

MAC genome assembly of E. woodruffi

We assembled the MAC genome of E. woodruffi using the same pipeline for Tetmemena sp. (Chen et al., 2015) for comparative analysis: two draft genomes were assembled by SPAdes (Bankevich et al., 2012) and Trinity (Grabherr et al., 2011), and were then merged by CAP3 (Huang and Madan, 1999). Trinity, which is a software developed for de novo transcriptome assembly (Grabherr et al., 2011), has been used to assemble hypotrich MAC genomes (Chen et al., 2015) because their nanochromosome genome structure is similar to transcriptomes, including properties such as variable copy number and alternative isoforms (Lindblad et al., 2019). Telomeric reads were mapped to contigs by BLAT (Kent, 2002), and contigs were further extended and capped by telomeres when at least five reads pile up at a position near ends by custom python scripts (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/MAC_genome_telomere_capping) (Feng, 2022a). The mitochondrial DNA was removed if the contig has a TBLASTX (Camacho et al., 2009) hit on the Oxytricha mitochondrial genome (Genbank accession JN383842.1 and JN383843.1) or two Euplotes mitochondrial genomes (Euplotes minuta GQ903130.1, E. crassus GQ903131.1). Algal contigs were removed by BLASTN to all C. reinhardtii nucleotide sequences downloaded from Genbank. Non-telomeric contigs were mapped to bacterial NR by BLASTX to remove bacterial contaminations. The genome was further compressed by CD-HIT (Fu et al., 2012) in two steps: (1) contigs <500 bp were removed if 90% of the short contig can be aligned to a contig ≥ 500 bp with 90% similarity (-c 0.9 -aS 0.9 -uS 0.1); (2) then the genome was compressed by 95% similarity (-c 0.95 -aS 0.9 -uS 0.1). Contigs shorter than 500 bp without telomeres were removed. Nine contigs, likely Tec contaminants from the MIC genome, were also excluded (Tblastn, ‘-db_gencode 10 -evalue 1e-5’), and they could be assembled due to the high copy number in the MIC genome (47, 48, Genbank accessions of Tec ORFs are AAA62601.1, AAA62602.1, AAA62603.1, AAA91339.1, AAA91340.1, AAA91341.1, AAA91342.1).

RNA sequencing of E. woodruffi and Tetmemena sp.

Three biological replicates of total RNA was isolated from asexually growing E. woodruffi and Tetmemena sp. cells using TRIzol reagent (Thermo Fisher Scientific) and enriched for the poly(A)+fraction using the NEBNext Poly(A) mRNA Magnetic Isolation Module (New England Biolabs). Stranded RNA-seq libraries were constructed using the ScriptSeq v2 RNA-seq library preparation kit (Epicentre) and sequenced on an Illumina Nextseq500 at the Columbia Genome Center. For E. woodruffi, the transcriptome was assembled by Trinity (Grabherr et al., 2011), and transcript alignments to the MAC genome were generated by PASA (Haas et al., 2003).

Gene prediction of the E. woodruffi MAC genome and validation of MAC genome completeness

We followed the gene prediction pipeline developed by the Broad institute (https://github.com/PASApipeline/PASApipeline/wiki); using EVidenceModeler (EVM, Haas et al., 2008) to generate the final gene predictions. EVM produced gene structures by weighted combination of evidence from three resources: ab initio prediction, protein alignments, and transcript alignments (the weight was 3, 3, and 10 respectively). Ab initio prediction was generated by BRAKER2 pipeline (Brůna et al., 2021). Protein alignments for EVM were generated by mapping Oxytricha proteins to the E. woodruffi MAC genome by Exonerate (Slater and Birney, 2005). EVM predicted 33,379 genes on MAC chromosomes with at least one telomere.

We assessed MAC genome completeness using three methods: (1) 28,294 (80.6%) of the 35,099 E. woodruffi MAC contigs have at least one telomere. (2) In the E. woodruffi genes predicted on telomeric contigs, 88.8% of BUSCO (Simão et al., 2015; Manni et al., 2021) genes in the lineage database alveolata_odb10 were identified as complete. Within the 171 BUSCO genes, 135 are complete and single-copy, 17 are complete and duplicated, 7 are fragmented, and 12 are missing. This represents the best Euplotes MAC genome assembly available. (3) We identified 51 tRNA genes encoding all 20 amino acids by tRNAscan-SE (Lowe and Eddy, 1997) in the MAC genome, including two suppressor tRNAs of UAA and UAG.

MIC genome assembly of Tetmemena sp.

The MIC genome of Tetmemena was assembled with a hybrid approach to combine reads from different sequencing platforms. Tetmemena Illumina reads were first assembled by SPAdes (77, parameters ‘-k 21,33,55,77,99,127 –careful’). PacBio reads were error corrected by FMLRC (Wang et al., 2018) using Illumina reads with default parameters. Corrected PacBio reads were aligned to both the MAC genome and the Illumina MIC assembly with BLASTN. Reads were removed if they start or end with telomeres or are aligned better to the MAC. The remaining reads were assembled with wtdbg2 (Ruan and Li, 2020, parameters ‘-x rs’). The PacBio assembly was polished by Pilon (Walker et al., 2014) with the ‘--diploid’ option. The Illumina and PacBio assemblies were merged by quickmerge (Chakraborty et al., 2016) with the ‘-l 5000’ option.

MIC genome assembly of E. woodruffi

The MIC genome of E. woodruffi was assembled using a similar procedure as described above for Tetmemena. E. woodruffi reads were filtered to remove bacterial contamination, including abundant high-GC-content contaminants, possibly endosymbionts (Boscaro et al., 2019). Nanopore reads with GC content ≥55% were assembled by Flye (Kolmogorov et al., 2019) with the parameter ‘--meta’ for metagenomic assembly of bacterial contigs. We used kaiju (Menzel et al., 2016) to identify bacteria taxa for these contigs. 9 of 10 top-covered contigs derive from Proteobacteria, from which many Euplotes symbionts derive (Boscaro et al., 2019). Bacterial contamination was removed from Illumina reads if perfectly mapping to these metagenomic contigs by Bowtie2 (Langmead and Salzberg, 2012). The cleaned Illumina reads were then assembled by SPAdes with ‘-k 21,33,55,77,99,127’ (Bankevich et al., 2012). Pacbio raw reads and Nanopore raw reads with GC content <55% were aligned to a concatenated database containing both the MAC genome and the Illumina MIC assembly with BLASTN. Reads were removed if they start or end with telomeres or align better to the MAC. Remaining PacBio/Nanopore reads were assembled by Flye with ‘--meta’ mode. The PacBio-Nanopore assembly was polished by Pilon with the ‘--diploid’ option. Illumina and PacBio-Nanopore assemblies were merged by quickmerge with the ‘-l 10000’ option. Contigs shorter than 1 kb were removed.

MIC genome decontamination

The draft MIC genome of Tetmemena was first mapped to telomeric MAC contigs by BLASTN. MIC contigs containing MDSs were included in the final assembly. The rest of the MIC contigs were filtered by a decontamination pipeline: (1) contigs were aligned to the K. pneumoniae genome, C. reinhardtii genome, and the Oxytricha mitochondrial genome by BLASTN to remove contaminants; (2) the remaining contigs were then searched against the bacteria NR database and a ciliate protein database (including protein sequences annotated in Tetrahymena thermophila: http://www.ciliate.org/system/downloads/tet-latest/4-Protein%20fasta.fasta; Paramecium tetraurelia: http://paramecium.cgm.cnrs-gif.fr; and O. trifallax: https://oxy.ciliate.org) by BLASTX. Contigs with higher bit score to bacteria NR or G+C >45% were removed. The E. woodruffi MIC genome was decontaminated, similarly, with addition of all Chlorogonium sequences (the algal food source) on NCBI and the two Euplotes mitochondrial genomes (E. minuta GQ903130.1, E. crassus GQ903131.1) to filter contaminants.

Repeat identification

The repeat content in the MIC genomes was identified by RepeatModeler 1.0.10 (Smit and Hubley, 2008) and RepeatMasker 4.0.7 (Smit et al., 2013) with default parameters.

TBE/Tec detection

Representative Oxytricha TBE ORFs (Genbank accession AAB42034.1, AAB42016.1, and AAB42018.1) were used as queries to search TBEs in the Oxytricha and Tetmemena MIC genomes by TBLASTN (-db_gencode 6 -evalue 1e-7 -max_target_seqs 30000). Tec ORFs were similarly detected by using E. crassus Tec1 and Tec2 ORFs as queries (-db_gencode 10 -evalue 1e-5 -max_target_seqs 30000, Genbank accessions of Tec ORFs are AAA62601.1, AAA62602.1, AAA62603.1, AAA91339.1, AAA91340.1, AAA91341.1, AAA91342.1). Complete TBEs/Tecs were determined by custom python scripts when three ORFs are within 2000 bp from each other and in correct orientation (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/TBE_ORFs/TBE_to_oxy_genome_tblastn_parse.py, Chen and Landweber, 2016). 30 TBE ORFs with >70% completeness were subsampled from each species for phylogenetic analysis (except for the 57 kD ORF in Tetmemena, for which 21 were subsampled). The subsampled TBE ORFs were aligned using MUSCLE (Edgar, 2004), and the alignments were trimmed by trimAl ‘-automated1’ (Capella-Gutiérrez et al., 2009). Phylogenetic trees were constructed using PhyML 3.3 (Guindon et al., 2010).

Rearrangement annotations

SDRAP (Braun et al., 2022) was used to annotate MDSs, pointers, and MIC-specific regions (minimum percent identity for preliminary match annotation = 95, minimum percent identity for additional match annotation = 90, minimum length of pointer annotation = 2). SDRAP requires MAC and MIC genomes as input. For the SDRAP annotation of Oxytricha, we used the MAC genome from Swart et al., 2013 instead of the latest hybrid assembly that incorporated PacBio reads (Lindblad et al., 2019), because the former version was primarily based on Illumina reads, similar to the MAC genomes of Tetmemena (7, Genbank GCA_001273295.2) and E. woodruffi which are also Illumina assemblies. Oxytricha and Tetmemena MAC genomes were preprocessed by removing MAC contigs with TBE ORFs, considered MIC contaminants (Chen and Landweber, 2016). SDRAP is a new program that can output the rearrangement annotations with minor differences from Chen et al., 2014, but most annotations are robust (Figure 3—figure supplement 2). Scrambled and nonscrambled junctions/IESs were annotated by custom python scripts (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/scrambled_nonscrambled_IES_pointer).

MIC genome categories

Each MIC genome region is assigned to only one category in Figure 2A–C, even if it belongs to more than one category. The assignment is based on the following priority: MDS, TBE/Tec, MIC genes (only available for Oxytricha, which has developmental RNA-seq data), IES, tandem repeats, other repeats, and non-coding non-repetitive regions. For example, an MIC region can be a TBE in an IES, and it is only considered as TBE in Figure 2A–C.

Ortholog comparison pipeline and Monte Carlo simulations

Orthogroups of genes on telomeric MAC contigs were detected by OrthoFinder with ‘-S blast’ (Emms and Kelly, 2019). Single-copy orthologs were aligned by Clustal Omega (Sievers et al., 2011). Protein alignments were reversely translated to CDS alignments by a modified script of pal2nal (Suyama et al., 2006, https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/Ortholog_comparison/pal2nal.pl). Two modifications were made in the script: (1) the modified script allows pal2nal to take different genetic codes for three sequences (-codontable 6,6,10); (2) the script also fixed an error in the original pal2nal script in which codontable 10 for the Euplotid nuclear code was the same as the universal code. Visualization of pointer positions and intron locations on orthologs was implemented by a custom python script (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/blob/main/Ortholog_comparison/visualization_of_ortholog_comparison.py). Pointer positions or intron locations are considered conserved if they are within a 20-bp alignment window on the CDS alignment. Protein domains were annotated by HMMER (Finn et al., 2011). We performed Monte Carlo simulations by randomly shuffling pointer locations on the CDS but keeping their original position distribution. This was implemented by a custom python script, which transforms the CDS to a circle, rotates pointer positions on the circle, and outputs the shuffled position on the re-linearized CDS (https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/blob/main/Ortholog_comparison/shuffle_simulation.py). The null hypothesis of the Monte Carlo test is that pointer positions are conserved by chance. p-Value of Monte Carlo test is given by N_{expected>observed}/N_total (N_{expected>observed} is the number of simulations when there are more conserved pointers in the simulation than the observation from real data, N_total = 1000 in this study).

PCR validation of Russian doll locus

The complex Russian doll locus on MIC contig TMEMEN_MIC_21461 in Tetmemena was validated by PCR to confirm the Tetmemena MIC genome assembly. Tetmemena micronuclear DNA was purified as described previously and used as template for PCR using PrimeSTAR Max DNA polymerase (Takara Bio). 11 primer sets (Supplementary file 14) were designed to amplify products between 3 kb and 6 kb in length, with overlapping regions between consecutive primer pairs. The resulting PCR products were visualized through agarose gel electrophoresis, and bands of the expected size were extracted using a Monarch DNA Gel Extraction Kit (New England Biolabs). The purified gel bands were cloned using a TOPO XL-2 Complete PCR Cloning Kit (Invitrogen), transformed into One Shot OmniMAX 2 T1R E. coli cells (Invitrogen), and individual clones were grown and their plasmids harvested with a QIAprep Spin Miniprep Kit (QIAGEN). The plasmid ends were Sanger sequenced, as well as the region where the Oxytricha MIC assembly contains inserted MDSs (Genewiz). Sanger sequencing reads were mapped to the Tetmemena MIC contig TMEMEN_MIC_21461 and visualized using Geneious Prime 2021.1.1 (https://www.geneious.com).

Availability of data and materials

Custom scripts are public on https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes, (Feng, 2022b copy archived at swh:1:rev:fd66a0efeaf9feb2d79e183313192d641b4e5400). DNA-seq reads and genome assemblies are available at GenBank under Bioprojects PRJNA694964 (Tetmemena sp.) and PRJNA781979 (E. woodruffi). Genbank accession numbers for genomes are JAJKFJ000000000 (Tetmemena sp. Micronucleus genome), JAJLLS000000000 (E. woodruffi Micronucleus genome), and JAJLLT000000000 (E. woodruffi Macronucleus genome).

Three replicates of RNA-seq reads for vegetative cells are available at GenBank under accession numbers of SRR21815378, SRR21815379, and SRR21815380 for E. woodruffi and SRR21817702, SRR21817703, and SRR21817704 for Tetmemena sp.

MDS annotations for three species are available at https://doi.org/10.5061/dryad.5dv41ns96 and https://knot.math.usf.edu/mds_ies_db/2022/downloads.html (please select species from the drop-down menu).

Acknowledgements

We thank Toshinobu Suzaki (Kobe University) for the gift of E. woodruffi (strain Iz01 from Shizuoka Prefecture) and Chlorogonium capillatum. We thank Sheela George for laboratory support and help with cell collection. We thank David Dai, Eoghan Harrington, John Beaulaurier, and Sissel Juul at Oxford Nanopore Technologies in New York for providing sequencing and advice. We thank Robert Sebra and Melissa Smith for advice and PacBio sequencing. We thank Takahiko Akematsu, Lorraine Symington, and Lea Marie for helping with PFGE. We thank Kaiyi Zhu, Shaojie He, Molly Przeworski, Harmen Bussemaker, and Nataša Jonoska for advice on Monte Carlo simulations. We also thank Scott Roy, Samuel Sternberg, Bill Jack, and all current and past Landweber lab members for discussion about the origin of scrambled genes, as well as David Prescott and Klaus Heckmann for inspiration, and Margarita T Angelova, Sindhuja Devanapally, Danylo Villano and Kehan Bao for comments on the manuscript. This work was supported by the National Institutes of Health, R35GM122555, and National Science Foundation, DMS1764366, and the National Center for Genome Analysis Support computing resources (supported by National Science Foundation DBI1062432, ABI1458641, and ABI1759906 to Indiana University). Rafik Neme was supported by the Pew Latin American Fellows Program.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Laura F Landweber, Email: Laura.Landweber@columbia.edu.

Detlef Weigel, Max Planck Institute for Biology Tübingen, Germany.

Funding Information

This paper was supported by the following grants:

National Institutes of Health R35GM122555 to Laura F Landweber, Yi Feng, Leslie Y Beh, Michael W Lu.
National Science Foundation DMS1764366 to Yi Feng.
Pew Latin American Fellows Program to Rafik Neme.
National Science Foundation DBI1062432 to Yi Feng.
National Science Foundation ABI1458641 to Yi Feng.
National Science Foundation ABI1759906 to Yi Feng.

Additional information

Competing interests

No competing interests declared.

The author is currently employed by Illumina.

employed by Pacific Biosciences.

Author contributions

Conceptualization, Data curation, Formal analysis, Investigation, Visualization, Methodology, Writing - original draft, Writing – review and editing.

Conceptualization, Methodology.

Investigation.

Software.

Investigation, Validation.

Conceptualization, Supervision, Funding acquisition, Investigation, Project administration, Writing – review and editing.

Additional files

Supplementary file 1. Sequencing depth statistics for germline micronucleus (MIC) genome assemblies.

*Sequencing data from Chen et al., 2014.

**Raw reads were mapped to the MIC genome assembly by Minimap2 and Bowtie2 (Langmead and Salzberg, 2012). Average coverage was calculated with BBmap (sourceforge.net/projects/bbmap/) pileup.sh for macronuclear destined sequence-containing contigs in the MIC genome assembly.

elife-82979-supp1.docx^{(14.6KB, docx)}

Supplementary file 2. Subcategories of repeat content in the three species.

Repeat content of the three genomes, as annotated by Repeatmasker (Smit et al., 2013) with additional manual annotation of Telomere-Bearing Element (TBE)/Transposon of Euplotes crassus (TEC) elements. The numbers may differ from Figure 2A–C because some repeats are assigned as other germline micronucleus (MIC) categories in the pie charts (Methods). For example, a MIC region which is both an internally eliminated sequence (IES) and satellite, is assigned as IES in Figure 2A–C, but is counted as a satellite in this table.

elife-82979-supp2.docx^{(15KB, docx)}

Supplementary file 3. Telomere-bearing elements (TBE)/transposon of Euplotes crassus (TEC) elements open reading frames in three species.

* Differs from 10,109 in Chen et al. (Chen and Landweber, 2016) because we used different versions of BLAST and custom python scripts to identify complete TBEs (see Methods).

elife-82979-supp3.docx^{(14.2KB, docx)}

Supplementary file 4. Orthology among scrambled and nonscrambled genes in the three species.

* Ciliate database is generated by extracting all protein sequences in phylum Ciliophora (taxid: 5878) from NR database.

elife-82979-supp4.docx^{(15.9KB, docx)}

Supplementary file 5. Summary of orthologs in each pair of species.

The (i,j) cell shows the number of genes in species i with an ortholog in species j.

* Genes with no ortholog detected by OrthoFinder (Emms and Kelly, 2019) in the other two species.

elife-82979-supp5.docx^{(14.6KB, docx)}

Supplementary file 6. More scrambled somatic macronucleus (MAC) contigs contain at least one paralogous macronuclear destined sequence that may be involved in alternative rearrangement.

elife-82979-supp6.docx^{(14.5KB, docx)}

Supplementary file 7. Macronuclear destined sequence (MDS)-internally eliminated sequence (IES) pairs share homologous sequences in the three species (related to Figure 4—figure supplement 2).

elife-82979-supp7.docx^{(15.6KB, docx)}

Supplementary file 8. Genes with expression support in the three species.

elife-82979-supp8.docx^{(14.1KB, docx)}

Supplementary file 9. Presence of conserved pointers in three species, with Monte Carlo simulations.

elife-82979-supp9.docx^{(14.9KB, docx)}

Supplementary file 10. Scrambled pointers are more conserved than nonscrambled pointers.

elife-82979-supp10.docx^{(14.3KB, docx)}

Supplementary file 11. Most pointers conserved in position are different in sequence.

elife-82979-supp11.docx^{(14KB, docx)}

Supplementary file 12. Intron-IES conversion comparison in three species and Monte Carlo simulations.

elife-82979-supp12.docx^{(14.9KB, docx)}

Supplementary file 13. Pairwise intron-IES conversion comparisons and Monte Carlo simulations.

elife-82979-supp13.docx^{(15KB, docx)}

Supplementary file 14. PCR primers for validation of the Russian doll region in Tetmemena DNA (Figure 6A).

elife-82979-supp14.docx^{(14.9KB, docx)}

MDAR checklist

elife-82979-mdarchecklist1.docx^{(106.4KB, docx)}

Data availability

Custom scripts are public on https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes, (copy archived at swh:1:rev:fd66a0efeaf9feb2d79e183313192d641b4e5400). DNA-seq reads and genome assemblies are available at GenBank under Bioprojects PRJNA694964 (Tetmemena sp.) and PRJNA781979 (Euplotes woodruffi). Genbank accession numbers for genomes are JAJKFJ000000000 (Tetmemena sp. Micronucleus genome), JAJLLS000000000 (Euplotes woodruffi Micronucleus genome), and JAJLLT000000000 (Euplotes woodruffi Macronucleus genome). Three replicates of RNA-seq reads for vegetative cells are available at GenBank under accession numbers of SRR21815378, SRR21815379, SRR21815380 for E. woodruffi and SRR21817702, SRR21817703 and SRR21817704 for Tetmemena sp. MDS annotations for three species are available at https://doi.org/10.5061/dryad.5dv41ns96 and https://knot.math.usf.edu/mds_ies_db/2022/downloads.html (please select species from the drop-down menu).

The following datasets were generated:

Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. Euplotes woodruffi genome sequencing and assembly. NCBI BioProject. PRJNA781979

Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. Tetmemena sp. micronucleus genome sequencing and assembly. NCBI BioProject. PRJNA694964

Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. Euplotes woodruffi strain:Iz01. NCBI BioProject. PRJNA781602

Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. RNA-seq of Tetmemena sp. NCBI BioProject. PRJNA887426

Channagiri T, Braun J, Feng Y, Landweber LF. 2022. MDS-IES database. MDSIESDB. db/2022/downloads

Feng Y, Neme R, Beh L, Chen X, Braun J, Lu M, Landweber L. 2022. MDS and IES annotations for Euplotes woodruff, Tetmemena sp. and Oxytricha trifallax. Dryad Digital Repository.

Feng Y, Neme R, Beh LY, Chen X, Braun J, MW Lu, Landweber LF. 2022. Tetmemena sp. Micronucleus genome. NCBI GenBank. JAJKFJ000000000

Feng Y, Neme R, Beh LY, Chen X, Braun J, MW Lu, Landweber LF. 2022. Euplotes woodruffi Micronucleus genome. NCBI GenBank. JAJLLS000000000

Feng Y, Neme R, Beh LY, Chen X, Braun J, MW Lu, Landweber LF. 2022. Euplotes woodruffi Macronucleus genome. NCBI GenBank. JAJLLT000000000.

The following previously published datasets were used:

Chen X, Bracht JR, Goldman AD, Dolzhenko E, Clay DM, Swart EC, Perlman DH, Doak TG, Stuart A, Amemiya CT, Sebra RP, Landweber LF. 2014. Oxytricha trifallax micronucleus genome. NCBI Assembly. GCA_000711775.1

Swart EC, Bracht JR, Magrini V, Minx P, Chen X, Zhou Y, Khurana JS, Goldman AD, Nowacki M, Schotanus K, Jung S, Ly A, McGrath S, Haub K, Wiggins JL, Storton D, Matese JC, Parsons L, Chang WJ, Bowen MS, Stover NA, Jones TA, Eddy SR, Herrick GA, Doak TG, Wilson RK, Mardis ER, Landweber LF. 2013. Oxytricha trifallax macronucleus genome. NCBI Assembly. GCA_000295675.1

Chen X, Jung S, Beh LY, Eddy SR, Landweber LF. 2015. Tetmemena sp. macronucleus genome. NCBI Assembly. GCA_001273295.2

Beh LY, Debelouchina GT, Clay DM, Thompson RE, Lindblad KA, Hutton ER, Bracht JR, Sebra RP, Muir TW, Landweber LF. 2019. Genome-wide analysis of chromatin and transcription in the ciliates Oxytricha trifallax and Tetrahymena thermophila. NCBI Gene Expression Omnibus. GSE94421

References

Akematsu T, Fukuda Y, Garg J, Fillingham JS, Pearlman RE, Loidl J. Post-meiotic DNA double-strand breaks occur in Tetrahymena, and require topoisomerase II and SPO11. eLife. 2017;6:e26176. doi: 10.7554/eLife.26176. [DOI] [PMC free article] [PubMed] [Google Scholar]
Albertin CB, Medina-Ruiz S, Mitros T, Schmidbaur H, Sanchez G, Wang ZY, Grimwood J, Rosenthal JJC, Ragsdale CW, Simakov O, Rokhsar DS. Genome and transcriptome mechanisms driving cephalopod evolution. Nature Communications. 2022;13:2427. doi: 10.1038/s41467-022-29748-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Arnaiz O, Mathy N, Baudry C, Malinsky S, Aury JM, Denby Wilkes C, Garnier O, Labadie K, Lauderdale BE, Le Mouël A, Marmignon A, Nowacki M, Poulain J, Prajer M, Wincker P, Meyer E, Duharcourt S, Duret L, Bétermier M, Sperling L. The Paramecium germline genome provides a niche for intragenic parasitic DNA: evolutionary dynamics of internal eliminated sequences. PLOS Genetics. 2012;8:e1002984. doi: 10.1371/journal.pgen.1002984. [DOI] [PMC free article] [PubMed] [Google Scholar]
Aury JM, Jaillon O, Duret L, Noel B, Jubin C, Porcel BM, Ségurens B, Daubin V, Anthouard V, Aiach N, Arnaiz O, Billaut A, Beisson J, Blanc I, Bouhouche K, Câmara F, Duharcourt S, Guigo R, Gogendeau D, Katinka M, Keller AM, Kissmehl R, Klotz C, Koll F, Le Mouël A, Lepère G, Malinsky S, Nowacki M, Nowak JK, Plattner H, Poulain J, Ruiz F, Serrano V, Zagulski M, Dessen P, Bétermier M, Weissenbach J, Scarpelli C, Schächter V, Sperling L, Meyer E, Cohen J, Wincker P. Global trends of whole-genome duplications revealed by the ciliate Paramecium tetraurelia. Nature. 2006;444:171–178. doi: 10.1038/nature05230. [DOI] [PubMed] [Google Scholar]
Baird SE, Fino GM, Tausta SL, Klobutcher LA. Micronuclear genome organization in Euplotes crassus: a transposonlike element is removed during macronuclear development. Molecular and Cellular Biology. 1989;9:3793–3807. doi: 10.1128/mcb.9.9.3793-3807.1989. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology. 2012;19:455–477. doi: 10.1089/cmb.2012.0021. [DOI] [PMC free article] [PubMed] [Google Scholar]
Beh LY, Debelouchina GT, Clay DM, Thompson RE, Lindblad KA, Hutton ER, Bracht JR, Sebra RP, Muir TW, Landweber LF. Identification of a DNA N6-adenine methyltransferase complex and its impact on chromatin organization. Cell. 2019;177:1781–1796. doi: 10.1016/j.cell.2019.04.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Biederman MK, Nelson MM, Asalone KC, Pedersen AL, Saldanha CJ, Bracht JR. Discovery of the first germline-restricted gene by subtractive transcriptomic analysis in the zebra finch, Taeniopygia guttata. Current Biology. 2018;28:1620–1627. doi: 10.1016/j.cub.2018.03.067. [DOI] [PMC free article] [PubMed] [Google Scholar]
Boscaro V, Husnik F, Vannini C, Keeling PJ. Symbionts of the ciliate Euplotes: diversity, patterns and potential as models for bacteria-eukaryote endosymbioses. Proceedings of the Royal Society B: Biological Sciences. 2019;286:20190693. doi: 10.1098/rspb.2019.0693. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bracht JR, Perlman DH, Landweber LF. Cytosine methylation and hydroxymethylation mark DNA for elimination in Oxytricha trifallax. Genome Biology. 2012;13:1–23. doi: 10.1186/gb-2012-13-10-r99. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bracht JR, Fang W, Goldman AD, Dolzhenko E, Stein EM, Landweber LF. Genomes on the edge: programmed genome instability in ciliates. Cell. 2013;152:406–416. doi: 10.1016/j.cell.2013.01.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Braun J, Nabergall L, Neme R, Landweber LF, Saito M, Jonoska N. Russian doll genes and complex chromosome rearrangements in Oxytricha trifallax. G3: Genes, Genomes, Genetics. 2018;8:1669–1674. doi: 10.1534/g3.118.200176. [DOI] [PMC free article] [PubMed] [Google Scholar]
Braun J, Neme R, Feng Y, Landweber LF, Jonoska N. SDRAP for Annotating Scrambled or Rearranged Genomes. bioRxiv. 2022 doi: 10.1101/2022.10.24.513505. [DOI] [PMC free article] [PubMed]
Brůna T, Hoff KJ, Lomsadze A, Stanke M, Borodovsky M. BRAKER2: automatic eukaryotic genome annotation with genemark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics and Bioinformatics. 2021;3:lqaa108. doi: 10.1093/nargab/lqaa108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Burns J, Kukushkin D, Chen X, Landweber LF, Saito M, Jonoska N. Recurring patterns among scrambled genes in the encrypted genome of the ciliate Oxytricha trifallax. Journal of Theoretical Biology. 2016;410:171–180. doi: 10.1016/j.jtbi.2016.08.038. [DOI] [PMC free article] [PubMed] [Google Scholar]
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:1–9. doi: 10.1186/1471-2105-10-421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. TrimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chakraborty M, Baldwin-Brown JG, Long AD, Emerson JJ. Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage. Nucleic Acids Research. 2016;44:e147. doi: 10.1093/nar/gkw654. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chang WJ, Bryson PD, Liang H, Shin MK, Landweber LF. The evolutionary origin of a complex scrambled gene. PNAS. 2005;102:15149–15154. doi: 10.1073/pnas.0507682102. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Bracht JR, Goldman AD, Dolzhenko E, Clay DM, Swart EC, Perlman DH, Doak TG, Stuart A, Amemiya CT, Sebra RP, Landweber LF. The architecture of a scrambled genome reveals massive levels of genomic rearrangement during development. Cell. 2014;158:1187–1198. doi: 10.1016/j.cell.2014.07.034. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Jung S, Beh LY, Eddy SR, Landweber LF. Combinatorial DNA rearrangement facilitates the origin of new genes in ciliates. Genome Biology and Evolution. 2015;7:2859–2870. doi: 10.1093/gbe/evv172. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Landweber LF. Phylogenomic analysis reveals genome-wide purifying selection on TBE transposons in the ciliate Oxytricha. Mobile DNA. 2016;7:2. doi: 10.1186/s13100-016-0057-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen X, Jiang Y, Gao F, Zheng W, Krock TJ, Stover NA, Lu C, Katz LA, Song W. Genome analyses of the new model protist Euplotes vannus focusing on genome rearrangement and resistance to environmental stressors. Molecular Ecology Resources. 2019;19:1292–1308. doi: 10.1111/1755-0998.13023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen W, Zuo C, Wang C, Zhang T, Lyu L, Qiao Y, Zhao F, Miao M. The hidden genomic diversity of ciliated protists revealed by single-cell genome sequencing. BMC Biology. 2021;19:264. doi: 10.1186/s12915-021-01202-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cummings DJ, Tait A, Goddard JM. Methylated bases in DNA from Paramecium aurelia. Biochimica et Biophysica Acta - Nucleic Acids and Protein Synthesis. 1974;374:1–11. doi: 10.1016/0005-2787(74)90194-4. [DOI] [PubMed] [Google Scholar]
Denby Wilkes C, Arnaiz O, Sperling L. ParTIES: a toolbox for Paramecium interspersed DNA elimination studies. Bioinformatics. 2016;32:599–601. doi: 10.1093/bioinformatics/btv691. [DOI] [PubMed] [Google Scholar]
Doak TG, Witherspoon DJ, Jahn CL, Herrick G. Selection on the genes of Euplotes crassus Tec1 and Tec2 transposons: evolutionary appearance of a programmed frameshift in a Tec2 gene encoding a tyrosine family site-specific recombinase. Eukaryotic Cell. 2003;2:95–102. doi: 10.1128/EC.2.1.95-102.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
DuBois MI, Prescott DM. Scrambling of the actin I gene in two Oxytricha species. PNAS. 1995;92:3888–3892. doi: 10.1073/pnas.92.9.3888. [DOI] [PMC free article] [PubMed] [Google Scholar]
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004;32:1792–1797. doi: 10.1093/nar/gkh340. [DOI] [PMC free article] [PubMed] [Google Scholar]
Eisen JA, Coyne RS, Wu M, Wu D, Thiagarajan M, Wortman JR, Badger JH, Ren Q, Amedeo P, Jones KM, Tallon LJ, Delcher AL, Salzberg SL, Silva JC, Haas BJ, Majoros WH, Farzad M, Carlton JM, Smith RK, Garg J, Pearlman RE, Karrer KM, Sun L, Manning G, Elde NC, Turkewitz AP, Asai DJ, Wilkes DE, Wang Y, Cai H, Collins K, Stewart BA, Lee SR, Wilamowska K, Weinberg Z, Ruzzo WL, Wloga D, Gaertig J, Frankel J, Tsao CC, Gorovsky MA, Keeling PJ, Waller RF, Patron NJ, Cherry JM, Stover NA, Krieger CJ, del Toro C, Ryder HF, Williamson SC, Barbeau RA, Hamilton EP, Orias E. Macronuclear genome sequence of the ciliate Tetrahymena thermophila, a model eukaryote. PLOS Biology. 2006;4:e286. doi: 10.1371/journal.pbio.0040286. [DOI] [PMC free article] [PubMed] [Google Scholar]
Elliott TA, Gregory TR. What’s in a genome? the C-value enigma and the evolution of eukaryotic genome content. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences. 2015;370:20140331. doi: 10.1098/rstb.2014.0331. [DOI] [PMC free article] [PubMed] [Google Scholar]
Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology. 2019;20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feng Y, Beh LY, Chang WJ, Landweber LF. SIGAR: inferring features of genome architecture and DNA rearrangements by split-read mapping. Genome Biology and Evolution. 2020;12:1711–1718. doi: 10.1093/gbe/evaa147. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feng Y, Landweber LF. Transposon debris in ciliate genomes. PLOS Biology. 2021;19:e3001354. doi: 10.1371/journal.pbio.3001354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Feng Y. MAC genome telomere capping script. 871eb00Github. 2022a https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes/tree/main/MAC_genome_telomere_capping
Feng Y. Oxytricha_Tetmemena_Euplotes. swh:1:rev:fd66a0efeaf9feb2d79e183313192d641b4e5400Software Heritage. 2022b https://archive.softwareheritage.org/swh:1:dir:8ad132d58c3073da701bdde6700a37e2cdc01509;origin=https://github.com/yifeng-evo/Oxytricha_Tetmemena_Euplotes;visit=swh:1:snp:3e53ca9f9f0b0bc48a5c56d379e0def68cce596f;anchor=swh:1:rev:fd66a0efeaf9feb2d79e183313192d641b4e5400
Finn RD, Clements J, Eddy SR. HMMER web server: interactive sequence similarity searching. Nucleic Acids Research. 2011;39:W29–W37. doi: 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao F, Song W, Katz LA. Genome structure drives patterns of gene family evolution in ciliates, a case study using Chilodonella uncinata (Protista, Ciliophora, Phyllopharyngea) Evolution; International Journal of Organic Evolution. 2014;68:2287–2295. doi: 10.1111/evo.12430. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao F, Roy SW, Katz LA. Analyses of alternatively processed genes in ciliates provide insights into the origins of scrambled genomes and may provide a mechanism for speciation. MBio. 2015;6:e01998-14. doi: 10.1128/mBio.01998-14. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gao F, Warren A, Zhang Q, Gong J, Miao M, Sun P, Xu D, Huang J, Yi Z, Song W. The all-data-based evolutionary hypothesis of ciliated protists with a revised classification of the phylum Ciliophora (eukaryota, alveolata) Scientific Reports. 2016;6:24874. doi: 10.1038/srep24874. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gorovsky MA, Hattman S, Pleger GL. (6 N) methyl adenine in the nuclear DNA of a eucaryote, Tetrahymena pyriformis. The Journal of Cell Biology. 1973;56:697–701. doi: 10.1083/jcb.56.3.697. [DOI] [PMC free article] [PubMed] [Google Scholar]
Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A. Full-Length transcriptome assembly from RNA-Seq data without a reference genome. Nature Biotechnology. 2011;29:644–652. doi: 10.1038/nbt.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guérin F, Arnaiz O, Boggetto N, Denby Wilkes C, Meyer E, Sperling L, Duharcourt S. Flow cytometry sorting of nuclei enables the first global characterization of Paramecium germline DNA and transposable elements. BMC Genomics. 2017;18:327. doi: 10.1186/s12864-017-3713-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guindon S, Dufayard JF, Lefort V, Anisimova M, Hordijk W, Gascuel O. New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. Systematic Biology. 2010;59:307–321. doi: 10.1093/sysbio/syq010. [DOI] [PubMed] [Google Scholar]
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Research. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
Haas BJ, Salzberg SL, Zhu W, Pertea M, Allen JE, Orvis J, White O, Buell CR, Wortman JR. Automated eukaryotic gene structure annotation using EVidenceModeler and the program to assemble spliced alignments. Genome Biology. 2008;9:1–22. doi: 10.1186/gb-2008-9-1-r7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hamilton EP, Kapusta A, Huvos PE, Bidwell SL, Zafar N, Tang H, Hadjithomas M, Krishnakumar V, Badger JH, Caler EV, Russ C. Structure of the germline genome of Tetrahymena thermophila and relationship to the massively rearranged somatic genome. eLife. 2016;5:e19090. doi: 10.7554/eLife.19090. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoffman DC, Prescott DM. Evolution of internal eliminated segments and scrambling in the micronuclear gene encoding DNA polymerase alpha in two Oxytricha species. Nucleic Acids Research. 1997;25:1883–1889. doi: 10.1093/nar/25.10.1883. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hogan DJ, Hewitt EA, Orr KE, Prescott DM, Müller KM. Evolution of IESs and scrambling in the actin I gene in hypotrichous ciliates. PNAS. 2001;98:15101–15106. doi: 10.1073/pnas.011578598. [DOI] [PMC free article] [PubMed] [Google Scholar]
Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Research. 1999;9:868–877. doi: 10.1101/gr.9.9.868. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jahn CL, Krikau MF, Shyman S. Developmentally coordinated en masse excision of a highly repetitive element in E. crassus. Cell. 1989;59:1009–1018. doi: 10.1016/0092-8674(89)90757-5. [DOI] [PubMed] [Google Scholar]
Jahn CL, Doktor SZ, Frels JS, Jaraczewski JW, Krikau MF. Structures of the Euplotes crassus Tec1 and Tec2 elements: identification of putative transposase coding regions. Gene. 1993;133:71–78. doi: 10.1016/0378-1119(93)90226-s. [DOI] [PubMed] [Google Scholar]
Katz LA, Kovner AM. Alternative processing of scrambled genes generates protein diversity in the ciliate Chilodonella uncinata. Journal of Experimental Zoology. Part B, Molecular and Developmental Evolution. 2010;314:480–488. doi: 10.1002/jez.b.21354. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kent WJ. BLAT -- the BLAST-like alignment tool. Genome Research. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
Klobutcher LA, Herrick G. Consensus inverted terminal repeat sequence of Paramecium IESs: resemblance to termini of Tc1-related and Euplotes Tec transposons. Nucleic Acids Research. 1995;23:2006–2013. doi: 10.1093/nar/23.11.2006. [DOI] [PMC free article] [PubMed] [Google Scholar]
Klobutcher LA, Herrick GL. Developmental genome reorganization in ciliated protozoa: the transposon link. Progress in Nucleic Acid Research and Molecular Biology. 1997;56:1–62. doi: 10.1016/s0079-6603(08)61001-6. [DOI] [PubMed] [Google Scholar]
Kolmogorov M, Yuan J, Lin Y, Pevzner PA. Assembly of long, error-prone reads using repeat graphs. Nature Biotechnology. 2019;37:540–546. doi: 10.1038/s41587-019-0072-8. [DOI] [PubMed] [Google Scholar]
Krikau MF, Jahn CL. Tec2, a second transposon-like element demonstrating developmentally programmed excision in Euplotes crassus. Molecular and Cellular Biology. 1991;11:4751–4759. doi: 10.1128/mcb.11.9.4751-4759.1991. [DOI] [PMC free article] [PubMed] [Google Scholar]
Landweber LF, Kuo TC, Curtis EA. Evolution and assembly of an extremely scrambled gene. PNAS. 2000;97:3298–3303. doi: 10.1073/pnas.97.7.3298. [DOI] [PMC free article] [PubMed] [Google Scholar]
Landweber LF. Why genomes in pieces? Science. 2007;318:405–407. doi: 10.1126/science.1150280. [DOI] [PubMed] [Google Scholar]
Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012;9:357–359. doi: 10.1038/nmeth.1923. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lauth MR, Spear BB, Heumann J, Prescott DM. DNA of ciliated protozoa: DNA sequence diminution during macronuclear development of Oxytricha. Cell. 1976;7:67–74. doi: 10.1016/0092-8674(76)90256-7. [DOI] [PubMed] [Google Scholar]
Lindblad KA, Bracht JR, Williams AE, Landweber LF. Thousands of RNA-cached copies of whole chromosomes are present in the ciliate Oxytricha during development. RNA. 2017;23:1200–1208. doi: 10.1261/rna.058511.116. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lindblad KA, Pathmanathan JS, Moreira S, Bracht JR, Sebra RP, Hutton ER, Landweber LF. Capture of complete ciliate chromosomes in single sequencing reads reveals widespread chromosome isoforms. BMC Genomics. 2019;20:1037. doi: 10.1186/s12864-019-6189-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lowe TM, Eddy SR. tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Research. 1997;25:955–964. doi: 10.1093/nar/25.5.955. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lynn D. The Ciliated Protozoa: Characterization, Classification, and Guide to the Literature. Springer Science & Business Media; 2008. [Google Scholar]
Manni M, Berkeley MR, Seppey M, Simão FA, Zdobnov EM. BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Molecular Biology and Evolution. 2021;38:4647–4654. doi: 10.1093/molbev/msab199. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maurer-Alcalá XX, Knight R, Katz LA. Exploration of the germline genome of the ciliate Chilodonella uncinata through single-cell omics (transcriptomics and genomics) MBio. 2018a;9:e01836-17. doi: 10.1128/mBio.01836-17. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maurer-Alcalá XX, Yan Y, Pilling OA, Knight R, Katz LA. Twisted tales: insights into genome diversity of ciliates using single-cell ’omics. Genome Biology and Evolution. 2018b;10:1927–1939. doi: 10.1093/gbe/evy133. [DOI] [PMC free article] [PubMed] [Google Scholar]
Menzel P, Ng KL, Krogh A. Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications. 2016;7:11257. doi: 10.1038/ncomms11257. [DOI] [PMC free article] [PubMed] [Google Scholar]
Meyer F, Schmidt HJ, Plümper E, Hasilik A, Mersmann G, Meyer HE, Engström A, Heckmann K. UGA is translated as cysteine in pheromone 3 of Euplotes octocarinatus. PNAS. 1991;88:3758–3761. doi: 10.1073/pnas.88.9.3758. [DOI] [PMC free article] [PubMed] [Google Scholar]
Miller RV, Neme R, Clay DM, Pathmanathan JS, Lu MW, Yerlici VT, Khurana JS, Landweber LF. Transcribed germline-limited coding sequences in Oxytricha trifallax. G3. 2021;11:jkab092. doi: 10.1093/g3journal/jkab092. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mitcham JL, Lynn AJ, Prescott DM. Analysis of a scrambled gene: the gene encoding alpha-telomere-binding protein in Oxytricha nova. Genes & Development. 1992;6:788–800. doi: 10.1101/gad.6.5.788. [DOI] [PubMed] [Google Scholar]
Mitreva M, Blaxter ML, Bird DM, McCarter JP. Comparative genomics of nematodes. Trends in Genetics. 2005;21:573–581. doi: 10.1016/j.tig.2005.08.003. [DOI] [PubMed] [Google Scholar]
Möllenbeck M, Cavalcanti ARO, Jönsson F, Lipps HJ, Landweber LF. Interconversion of germline-limited and somatic DNA in a scrambled gene. Journal of Molecular Evolution. 2006;63:69–73. doi: 10.1007/s00239-005-0166-4. [DOI] [PubMed] [Google Scholar]
Morgan JT, Fink GR, Bartel DP. Excised linear introns regulate growth in yeast. Nature. 2019;565:606–611. doi: 10.1038/s41586-018-0828-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
Mourier T, Jeffares DC. Eukaryotic intron loss. Science. 2003;300:1393. doi: 10.1126/science.1080559. [DOI] [PubMed] [Google Scholar]
Nowacki M, Vijayan V, Zhou Y, Schotanus K, Doak TG, Landweber LF. RNA-Mediated epigenetic programming of a genome-rearrangement pathway. Nature. 2008;451:153–158. doi: 10.1038/nature06452. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nowacki M, Higgins BP, Maquilan GM, Swart EC, Doak TG, Landweber LF. A functional role for transposases in a large eukaryotic genome. Science. 2009;324:935–938. doi: 10.1126/science.1170023. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parenteau J, Maignon L, Berthoumieux M, Catala M, Gagnon V, Abou Elela S. Introns are mediators of cell response to starvation. Nature. 2019;565:612–617. doi: 10.1038/s41586-018-0859-7. [DOI] [PubMed] [Google Scholar]
Parfrey LW, Lahr DJG, Knoll AH, Katz LA. Estimating the timing of early eukaryotic diversification with multigene molecular clocks. PNAS. 2011;108:13624–13629. doi: 10.1073/pnas.1110633108. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prescott DM. The DNA of ciliated protozoa. Microbiological Reviews. 1994;58:233–267. doi: 10.1128/mr.58.2.233-267.1994. [DOI] [PMC free article] [PubMed] [Google Scholar]
Prescott JD, DuBois ML, Prescott DM. Evolution of the scrambled germline gene encoding alpha-telomere binding protein in three hypotrichous ciliates. Chromosoma. 1998;107:293–303. doi: 10.1007/s004120050311. [DOI] [PubMed] [Google Scholar]
Riley JL, Katz LA. Widespread distribution of extensive chromosomal fragmentation in ciliates. Molecular Biology and Evolution. 2001;18:1372–1377. doi: 10.1093/oxfordjournals.molbev.a003921. [DOI] [PubMed] [Google Scholar]
Rogozin IB, Wolf YI, Sorokin AV, Mirkin BG, Koonin EV. Remarkable interkingdom conservation of intron positions and massive, lineage-specific intron loss and gain in eukaryotic evolution. Current Biology. 2003;13:1512–1517. doi: 10.1016/s0960-9822(03)00558-x. [DOI] [PubMed] [Google Scholar]
Ruan J, Li H. Fast and accurate long-read assembly with wtdbg2. Nature Methods. 2020;17:155–158. doi: 10.1038/s41592-019-0669-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmidbaur H, Kawaguchi A, Clarence T, Fu X, Hoang OP, Zimmermann B, Ritschard EA, Weissenbacher A, Foster JS, Nyholm SV, Bates PA, Albertin CB, Tanaka E, Simakov O. Emergence of novel cephalopod gene regulation and expression through large-scale genome reorganization. Nature Communications. 2022;13:2172. doi: 10.1038/s41467-022-29694-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
Seah BKB, Swart EC, Alkan C. BleTIES: annotation of natural genome editing in ciliates using long read sequencing. Bioinformatics. 2021;37:3929–3931. doi: 10.1093/bioinformatics/btab613. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sellis D, Guérin F, Arnaiz O, Pett W, Lerat E, Boggetto N, Krenek S, Berendonk T, Couloux A, Aury JM, Labadie K, Malinsky S, Bhullar S, Meyer E, Sperling L, Duret L, Duharcourt S. Massive colonization of protein-coding exons by selfish genetic elements in Paramecium germline genomes. PLOS Biology. 2021;19:e3001309. doi: 10.1371/journal.pbio.3001309. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sheng Y, Duan L, Cheng T, Qiao Y, Stover NA, Gao S. The completed macronuclear genome of a model ciliate Tetrahymena thermophila and its application in genome scrambling and copy number analyses. Science China. Life Sciences. 2020;63:1534–1542. doi: 10.1007/s11427-020-1689-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 2011;7:539. doi: 10.1038/msb.2011.75. [DOI] [PMC free article] [PubMed] [Google Scholar]
Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:1. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smit AF, Hubley R. RepeatModeler Open-1.0. 2008. [September 23, 2020]. http://www.repeatmasker.org
Smit AF, Hubley R, Green P. RepeatMasker Open-4.0. 2013. [September 23, 2020]. http://www.repeatmasker.org
Smith JJ, Baker C, Eichler EE, Amemiya CT. Genetic consequences of programmed genome rearrangement. Current Biology. 2012;22:1524–1529. doi: 10.1016/j.cub.2012.06.028. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith SA, Maurer-Alcalá XX, Yan Y, Katz LA, Santoferrara LF, McManus GB. Combined genome and transcriptome analyses of the ciliate schmidingerella arcuata (Spirotrichea) reveal patterns of DNA elimination, scrambling, and inversion. Genome Biology and Evolution. 2020;12:1616–1622. doi: 10.1093/gbe/evaa185. [DOI] [PMC free article] [PubMed] [Google Scholar]
Speijer D. Making sense of scrambled genomes. Science. 2008;319:901–902. doi: 10.1126/science.319.5865.901a. [DOI] [PubMed] [Google Scholar]
Suyama M, Torrents D, Bork P. PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Research. 2006;34:W609–W612. doi: 10.1093/nar/gkl315. [DOI] [PMC free article] [PubMed] [Google Scholar]
Swart EC, Bracht JR, Magrini V, Minx P, Chen X, Zhou Y, Khurana JS, Goldman AD, Nowacki M, Schotanus K, Jung S, Fulton RS, Ly A, McGrath S, Haub K, Wiggins JL, Storton D, Matese JC, Parsons L, Chang WJ, Bowen MS, Stover NA, Jones TA, Eddy SR, Herrick GA, Doak TG, Wilson RK, Mardis ER, Landweber LF. The Oxytricha trifallax macronuclear genome: a complex eukaryotic genome with 16,000 tiny chromosomes. PLOS Biology. 2013;11:e1001473. doi: 10.1371/journal.pbio.1001473. [DOI] [PMC free article] [PubMed] [Google Scholar]
Syberg-Olsen MJ, Irwin NAT, Vannini C, Erra F, Di Giuseppe G, Boscaro V, Keeling PJ, Schubert M. Biogeography and character evolution of the ciliate genus euplotes (spirotrichea, euplotia), with description of euplotes curdsi sp. nov. PLOS ONE. 2016;11:e0165442. doi: 10.1371/journal.pone.0165442. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tan M, Brünen-Nieweler C, Heckmann K. Isolation of micronuclei from euplotes octocarinatus and identification of an internal eliminated sequence in the micronuclear gene encoding γ-tubulin 2. European Journal of Protistology. 1999;35:208–216. doi: 10.1016/S0932-4739(99)80039-X. [DOI] [Google Scholar]
Thomas CA. The genetic organization of chromosomes. Annual Review of Genetics. 1971;5:237–256. doi: 10.1146/annurev.ge.05.120171.001321. [DOI] [PubMed] [Google Scholar]
Vinogradov DV, Tsoĭ OV, Zaika AV, Lobanov AV, Turanov AA, Gladyshev VN, Gel’fand MS. Draft macronuclear genome of a ciliate Euplotes crassus. Molekuliarnaia Biologiia. 2012;46:361–366. doi: 10.1134/S0026893312020197. [DOI] [PubMed] [Google Scholar]
Walker BJ, Abeel T, Shea T, Priest M, Abouelliel A, Sakthikumar S, Cuomo CA, Zeng Q, Wortman J, Young SK, Earl AM. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLOS ONE. 2014;9:e112963. doi: 10.1371/journal.pone.0112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang W, Zhi H, Chai B, Liang A. Cloning and sequence analysis of the micronuclear and macronuclear gene encoding Rab protein of Euplotes octocarinatus. Bioscience, Biotechnology, and Biochemistry. 2005;69:649–652. doi: 10.1271/bbb.69.649. [DOI] [PubMed] [Google Scholar]
Wang R, Xiong J, Wang W, Miao W, Liang A. High frequency of +1 programmed ribosomal frameshifting in Euplotes octocarinatus. Scientific Reports. 2016;6:21139. doi: 10.1038/srep21139. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang JR, Holt J, McMillan L, Jones CD. FMLRC: hybrid long read error correction using an FM-index. BMC Bioinformatics. 2018;19:50. doi: 10.1186/s12859-018-2051-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wong LC, Landweber LF. Evolution of programmed DNA rearrangements in a scrambled gene. Molecular Biology and Evolution. 2006;23:756–763. doi: 10.1093/molbev/msj089. [DOI] [PubMed] [Google Scholar]
Yerlici VT, Landweber LF. Programmed genome rearrangements in the ciliate Oxytricha. Microbiology Spectrum. 2014;2:2–6. doi: 10.1128/microbiolspec.MDNA3-0025-2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zheng W, Chen J, Doak TG, Song W, Yan Y. ADFinder: accurate detection of programmed DNA elimination using NGS high-throughput sequencing data. Bioinformatics. 2020;36:3632–3636. doi: 10.1093/bioinformatics/btaa226. [DOI] [PubMed] [Google Scholar]

eLife. doi: 10.7554/eLife.82979.sa0

Editor's evaluation

Detlef Weigel ¹

The study marks a significant advance in the field of evolutionary genomics of ciliates, an ancient and highly diverse eukaryotic phylum with many idiosyncrasies that teach us valuable lessons, inter alia, about sex and the plasticity of genomes. By focusing on two species from the same family, plus a more distant outgroup within the same class, this valuable study provides new and compelling information on evolutionary trends of genome rearrangement among different species of this interesting group of organisms. The work will be of interest to anyone interested in genome dynamics.

eLife. doi: 10.7554/eLife.82979.sa1

Decision letter

Editor: Detlef Weigel¹

In the interests of transparency, eLife publishes the most substantive revision requests and the accompanying author responses.

[Editors' note: this paper was reviewed by Review Commons.]

Thank you for submitting your article "Comparative genomics reveals insight into the evolutionary origin of massively scrambled genomes." for consideration by eLife. Your article has been reviewed by 2 peer reviewers at Review Commons, and the evaluation at eLife has been overseen by Detlef Weigel as Deputy Editor in discussion with several Senior and Reviewing Editors and an outside expert.

Based on your manuscript, the reviews and your responses, we invite you to submit a revised version incorporating the revisions as outlined in your response to the reviews.

When preparing your revisions, please also address the following points:

A major concern is that the genome assemblies are poor. Much of this is probably because of the technical challenges associated with this system, but BUSCO scores of 76% are and N50 <30kb for the MICs are red flags. This does make one wonder how confident one can be that some of the conclusions, such as those pertaining to paralogs, are not at least in part due to assembly artifacts. It is striking that most genes undergoing unscrambling have no orthologs in others species, and that is thus unclear what these genes actually do. Please provide additional evidence that the conclusions are unaffected by the limitations of the assemblies. Please also provide more statistics regarding the assemblies (e.g., expected lengths distribution of nanochromosomes in MACs, number of chromosomes in MICs vs contigs, etc.), and annotations (gene families -- how similar are they between species, how many "species-specific" genes [which the genomics community considers to be largely, though not entirely artifactual] etc.).

Please spell out more clearly what you consider the evolutionary forces that have increased scrambling. My reading is that you consider many of the scrambled genes not to be functionally very important, which would be in line with there being so few orthologs in other species. To this end, please provide some basic information as to how many of the scrambled genes have expression support, what the distribution of expression quantiles is etc. -- this could help to shore up these inferences.

Finally, the manuscript is overly long, and despite -- or perhaps because -- of this, it is not presented in the most accessible manner. For example, the abstract still struggles to explain the importance or relevance of the work. e.g. the last sentence is "Scrambled loci are more often associated with local duplications, supporting a simple model for the origin of scrambled genes via DNA duplication and decay". This would need some motivation, i.e., why should we care about scrambled genes per se? Similarly, why does the Introduction have an almost 1-page summary of the results?

eLife. 2022 Nov 24;11:e82979. doi: 10.7554/eLife.82979.sa2

Author response

Reviewer #1 (Evidence, reproducibility and clarity (Required)):

Summary:

Ciliates extensively rearrange their somatic genome every time a new somatic nucleus develops from the zygotic germline nucleus. In this manuscript, Feng et al. report the sequencing, assembly and annotation of the germline and somatic genomes of Euplotes woodruffi and the germline genome of Tetmemena sp. (whose somatic genome was sequenced and assembled by the same lab in 2015). They present a comparative analysis of developmentally programmed genome rearrangements in these two species and in the model ciliate Oxytricha trifallax. Their major findings are that:

(i) E. woodruffi and Tetmemena sp. eliminate a smaller fraction of their germline genome (~54%) from their somatic macronucleus (MAC) than O. trifallax (>80%)

(ii) Transposable elements (TE) represent a smaller fraction of the germline genome (~2%) in the first two ciliates than in O. trifallax (~15%). TEs are mainly located at the boundaries of germline chromosomes and in intergenic regions, but can also be found inside IESs

(iii) Several thousands of genes are scrambled in the germline genome of all three species

The authors have also addressed the possible origin of gene scrambling. They report an interesting association with local paralogy and propose a model for the emergence of the odd-even pattern of gene unscrambling between two paralogous copies.

Major comments:

1. Based on the statistics presented in Table 1, genome assemblies are of good quality, with a reasonable N50 size of germline (MIC) contigs. It seems, however, that no entire MIC chromosome could be assembled, since no two-telomere contig is mentioned in the list. As proposed by the authors (p.7) the presence of numerous TEs at the boundaries of MIC contigs (Figure S1) may have hindered the assembly of MIC chromosome ends. I would have appreciated to have more information on the "other repeats" (which seem to differ from tandem repeats according to Figure 2) and their location along MIC contigs.

Subcategories of “other repeats” were included in Table S2 based on Repeatmasker annotations. We now analyzed the locations of other repeats in MIC contigs and include those as well in new Figure S1B. About 30% of “other” transposable elements are present at the boundaries of MIC contigs, which may also hinder the assembly. Notably, 35-45% of “other TEs” are in assembled, intergenic regions.

2. The definition of "Internal Eliminated Sequences" (IES) is not clear. The authors make a distinction between IESs and TEs. I understand that IESs are DNA segments that separate two macronuclear-destined sequences (MDS) in the germline genome. Thus they appear to be restricted to those regions that eventually yield gene-sized MAC chromosomes. IESs are eliminated between two pointers that may not be identical on both sides in case of scrambled genes. Some clarification is needed here.

To illustrate my point: I found the statement "with many TE insertions within IESs, suggesting that TE insertions may have generated IESs" particularly confusing (p. 9 lines 5-6). Does this mean that IESs extend beyond the ends of inserted TEs? The legend of Figure S1 should also be clarified.

We clarified the text and legend. IESs can extend beyond the ends of inserted TEs, even if the original IES is a decayed TE, due to subsequent sequence evolution at the boundaries or if the original insertion was into an existing IES. David Prescott referred to sequence evolution at the edges of IESs as “pointer sliding” (ref.36).

3. p. 10 lines 2-4 and Figure S2: Could the authors explain the difference they make between MDS (in the text) and CDS (in Figure S2)? My understanding is that a CDS is the entire gene coding sequence and may be made of multiple MDSs. If this is correct, the sentence should read "We compared the number of MDSs between single-copy orthologs for single-gene MAC chromosomes across the three species and found that the orthologs have similar CDS lengths".

Yes, we made the correction.

4. p. 12 lines 10-15: the discovery that paralogous MDSs can be found in scrambled genomic loci is interesting. If the two paralogs can be distinguished based on the number of substitutions, it would be informative to go back to individual reads and check whether each of the two copies can be incorporated in the unscrambled CDS (and at which frequency). Would the pointers be compatible with this?

The paralogous MDSs in the MIC are often not identical. The copy with the highest similarity is assigned as “preliminary match” by SDRAP (ref. 52), and others are assigned as “additional matches”. To validate SDRAP assignments, we did pairwise BLASTN alignments (“-task megablast”) of paralogous MIC MDSs and their corresponding MAC MDSs. We confirmed that in the three species, the preliminary match has the best or equally best pid (percentage of identity) in most cases. Therefore, the MDS assigned as preliminary match is more likely the paralog incorporated into the MAC chromosome.

We used genome assemblies of Euplotes woodruffi, which had the highest Nanopore coverage, to further investigate the frequency of MDS incorporation. We followed the reviewer’s suggestion and called SNP variants on both MAC and MIC genomes. For MAC SNP calling, we used Illumina reads as input for freebayes (ref a). For MIC SNP calling, we used Nanopore reads, instead of Illumina reads, to avoid non-specific short-read mapping on paralogous MDSs and to avoid the presence of any contaminating MAC reads. Variants were called and phased by PEPPER-Margin-DeepVariant (ref b), a new tool published in 2021 in Nature Methods, which has been reported to have similar accuracy to Illumina read variant calling, especially at high read coverage. We used the parameter “--pepper_min_coverage_threshold 20” to call confident variants when at least 20 reads cover the position. Only 92 MIC SNPs in the paralogous MDSs passed all filters of the program. Using this small set of MIC SNPs, we were unfortunately unable to distinguish which paralogous MIC MDS was incorporated into the MAC. Therefore, we cannot infer with what frequency one paralogous MDS is incorporated over another, until they become sufficiently diverged, which is compatible with the model.

Garrison E, Marth G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907. 2012 Jul 17.
Shafin K, Pesout T, Chang PC, Nattestad M, Kolesnikov A, Goel S, Baid G, Kolmogorov M, Eizenga JM, Miga KH, Carnevali P. Haplotype-aware variant calling with PEPPER-Margin-DeepVariant enables high accuracy in nanopore long-reads. Nature methods. 2021 Nov;18(11):1322-32.

5. The hypothesis that odd-even scrambled loci have evolved from paralogous genes in E. woodruffi is supported by the existence of paralogous MDSs, length conservation of MDS/IES pairs and sequence similarity between corresponding MDS and IES in a pair. The correlations presented for Oxytricha and Tetmemena are much less convincing (Figure S5D and E). I recommend that the authors are even more cautious in their statement on p.13 ("For Oxytricha and Tememena, the MDS and IES lengths for such MDS/IES pairs also correlate positively, but more moderately")

Thank you, we rephrased the text.

6. p. 15 last paragraph: Why did the authors focus only on TBEs inserted in non-scrambled IESs to look for orthologous TBE insertions? Is there a reason to believe that no recent TBE insertion occurred at other genomic loci? Or was it only for practical reasons? It is also not clear to me whether the authors have considered full-length TBEs or the presence of at least one TBE ORF.

This analysis was limited for practical reasons, because we identify position conservation of TBEs by aligning protein sequences of MAC genes. We only consider TBEs inserted in non-scrambled IESs in exons. It would be difficult and less meaningful to align completely non-coding MIC-limited regions.

Partial TBEs are also included if they contain at least one TBE ORF (detected by BLAST).

Furthermore, TE insertion cannot explain the origin of scrambled IESs, and TEs rarely map to scrambled IESs (Figure S1A), but there is a clear evolutionary model for the origin of nonscrambled IESs from decay of TBEs (ref. 49). Initial purifying selection would act on the TE to maintain its ability to self-excise, whereas we advocate for a different model for the origin of scrambled IESs by decay of paralogous MDSs.

7. p. 16: the authors report that some introns of E. woodruffi map "near" Oxytricha/Tetmemena pointers. How near? Based on the information provided by the authors, I don't think this observation necessarily implies that IESs were converted to introns (or reciprocally) during evolution. If this were true, shouldn't at least one intron boundary coincide exactly with a pointer? The authors should clarify this (also in the discussion, on p. 20, top paragraph).

We used a 20bp window (~7 amino acids), as described in the Methods, and added that to the Results. Full detail is provided in the Methods section, “Ortholog comparison pipeline and Monte Carlo simulations”. 103 E. woodruffi introns are within 20bp from the midpoint of Oxytricha/Tetmemena pointers. Among these, 43 intron boundaries overlap an Oxytricha or Tetmemena pointer. We observed 306 cases of precisely matching boundaries between any two species, where the exon junction of one species maps inside the MDS/IES pointer of another species, although we would only expect the boundaries of introns and IESs to coincide so precisely if they were recent conversions. Hence we feel that a window analysis is informative.

8. p. 19 2nd paragraph: the suggested mechanism explaining the 5' bias of IESs in E. woodruffi genes is unclear. How could germline recombination take place between a MIC chromosome and a MAC reverse transcript or nanochromosome? This would imply that DNA could be imported in the MIC. Is there evidence that this might occur?

The ability of TEs to invade the MIC demonstrates that even foreign DNA can be incorporated into the MIC. Since MAC DNA is present at high copy number, it offers a potential source for a recombination template that could erase IESs, as could an errant reverse transcript of one of the long noncoding template RNAs. Any of these would be infrequent events that would matter on an evolutionary time scale even if developmentally rare.

9. According to Figure 1, no scrambled genes have been reported in Paramecium tetraurelia. Within the frame of the proposed model, this is somewhat unexpected because this ciliate went through several whole genome duplications during evolution and harbors many paralogous gene pairs. Is there a reason why no gene scrambling took place in Paramecium?

Paramecium uses only TA dinucleotide pointers for IES elimination, unlike the rich diversity of pointers in spirotrichous ciliates. This limitation in its machinery may explain why no scrambled loci have been observed in Paramecium, despite the abundance of paralogs. Our model suggests that local MIC paralogy is associated with the origin of scrambling. But most of the paralogy reported in Paramecium is at the level of whole chromosomes in the MAC (ref. 104) rather than local MIC paralogy.

Minor comments:

p. 4 (4th bottom line): To my knowledge, ref #28 presents a draft (incomplete) MIC assembly of the Paramecium genome.

Thank you, we added reference 29 and adjusted the wording describing the quality of MIC genome draft assemblies.

p. 7 (last paragraph): "encoding" should be replaced by "carrying"

Thank you, we made the change.

p. 10 (2nd paragraph): insert a missing "o" into "nanochromosomes"

Thank you, corrected.

p. 10 (same paragraph): the weak 5' bias of IES distribution in Tetmemena should be shown (either as an additional panel in Figure 3 or in a Sup Figure).

Thank you, we added it as Figure S2C.

p. 24 2nd paragraph: "a" is missing in "Trinity, which is a software…"

Thank you, we made the correction.

Cross-Consultation Comments

I agree with most comments of reviewer 3.

The authors have actually defined "TE" in the introduction (p. 6). Depending on the journal's rules for abbreviation use, it may not be necessary to define it again in the Results section

Reviewer #1 (Significance (Required)):

Ciliates are unicellular models to study developmentally programmed genome rearrangements at the mechanistic, genome-wide and evolutionary levels. These aspects have so far mostly been addressed in three species: P. tetraurelia and Tetrahymena thermophila on the one hand, the spirotrichous ciliate O. trifallax on the other.

One new piece of information that can be found in the present manuscript is the assembly and annotation of the germline genome of two novel species: Tetmemena sp, closely related to Oxytricha, and the more distant E. woodruffi. Feng et al. establish that, similar to other ciliates, Tetmemena and Euplotes eliminate TEs and other germline-specific sequences during programmed genome rearrangements. They also undergo extensive gene unscrambling, which results in IES removal and MDS reordering to assemble coding sequences.

A TE origin was discussed previously for Paramecium (Arnaiz et al. PLoS Genet; Sellis et al. 2021 PLoS Biol) and Tetrahymena IESs (Hamilton et al. 2016 eLife). While this may also hold true in spirotrichous ciliatesThe present manuscript proposes a completely new evolutionary scenario for IESs from scrambled genes. Here, Feng et al. establish that scrambled genes of spirotrichous ciliates tend to be associated with local paralogy. They provide evidence supporting that IESs from scrambled genes may have evolved from paralogous MDSs.

Although I am more an expert in the molecular mechanisms involved in genome rearrangements, I feel that the work reported here should draw the attention of a broader audience interested in genome dynamics and evolution, beyond the specific field of spirotrichous ciliate biology.

Reviewer #3 (Evidence, reproducibility and clarity (Required)):

Feng et al. provide a solid analysis of the evolution of genome rearrangement in spirotrich ciliates. The authors applied a variety of state-of-the-art sequencing and bioinformatic methods to investigate the intriguing and extremely complex patterns of genome architecture in this protist lineage. Methods (including statistical analyses) are adequate and explained in detail. Results and discussions reflect careful, clever analysis of the data and excellent linkage with the literature. Figures and tables complement the text in a compelling way. I have only minor suggestions:

Summary: more gradually introduce Spirotrichea and the phylogenetic relationship among the three species analyzed. This would better position the reader to understand the evolutionary context you are working in. Also, it would be helpful to more clearly differentiate novel vs. existing data. A suggestion: "This study focuses on three spirotrich species: two in the family Oxytrichidae (Oxytricha trifallax and Tetmemena sp) and Euplotes woodruffi as an outgroup. To complement existing data, we sequenced, assembled and annotated the germiline and somatic genomes of E. woodruffi and the germline genome of Tetmemena sp."

Thank you, we clarified the summary (abstract).

Introduction, first paragraph: Replace "The species in this study…" for a more precise statement, such as "The three spirotrich species studied here…"

Thank you, we have made this statement more precise.

p. 4: This sentence is unclear: "These useful tools provide partial insight to guide selection of species for full genome sequencing, which allows construction of complete rearrangement maps of a MIC genome onto a MAC genome for a reference species."

Thank you, we have clarified this sentence.

p. 8: define TE on first mention.

Defined on page 6.

Table 1. Indicate which MIC and MAC data are from this study.

References are included for published data and a note has been added to indicate data from this study.

Reviewer #3 (Significance (Required)):

The present work represents a significant advance in the field of evolutionary genomics. The focus of the paper is on ciliates, an ancient (2 billion-year old) and highly diverse eukaryotic phylum that presents many peculiarities, including sex, nuclear dimorphism, genome rearrangement, high numbers of paralogs and transposons, etc. While some data exist on a few model ciliates of disparate phylogenetic position, this work focuses on two species taxonomically placed in the same family, plus a more distant outgroup within the same class. This gives a novel dimension to this study, that goes beyond exploring genome architecture in a single clade. Instead, it allows to explore evolutionary trends in genome rearrangement among relatively closely related species. This paper should be of high interest not only for ciliate biologists (like me), but also in relation to comparative genomics of protists/eukaryotes and germ-soma biology. I highly recommend publication.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

A major concern is that the genome assemblies are poor. Much of this is probably because of the technical challenges associated with this system, but BUSCO scores of 76% are and N50 <30kb for the MICs are red flags. This does make one wonder how confident one can be that some of the conclusions, such as those pertaining to paralogs, are not at least in part due to assembly artifacts. It is striking that most genes undergoing unscrambling have no orthologs in others species, and that is thus unclear what these genes actually do. Please provide additional evidence that the conclusions are unaffected by the limitations of the assemblies. Please also provide more statistics regarding the assemblies (e.g., expected lengths distribution of nanochromosomes in MACs, number of chromosomes in MICs vs contigs, etc.), and annotations (gene families -- how similar are they between species, how many "species-specific" genes [which the genomics community considers to be largely, though not entirely artifactual] etc.).

In this study, we compared both somatic MAC and germline MIC genomes in the three ciliate species. We would like to explain why we think all the genome assemblies are sufficiently high quality for our analysis.

The MAC genome of Euplotes woodruffi was newly assembled in this study, following the similar assembly pipeline for Oxytricha trifallax (6) and Tetmemena sp. (7). The BUSCO score of 76% that we reported in our submitted manuscript was assessed using BUSCO v3 and lineage dataset eukaryota_odb9. Fortunately, BUSCO recently updated their database (88) to sample more species including ciliates (see below). Using the latest BUSCO v5.4.3 and lineage dataset alveolata_odb10, we now find that the MAC genome of E. woodruffi has a BUSCO score of 88.8%, the highest for any published Euplotes genome. Therefore, we are confident in the quality of our MAC genome assembly.

As the authors of BUSCO describe (87), the BUSCO score is just one metric of assessing genome completeness and is strongly influenced by (1) the gene prediction model (the nonmodel ciliate Euplotes uses a different genetic code, UGA=Cys, and has fewer curated genes compared to other well-established model ciliates; therefore it has a less mature gene prediction model) and (2) the genetic distance from the sampled species in the BUSCO dataset (e.g. the C. elegans genome only contains 90% of BUSCO genes, as shown in ref. 87). In our case, the missing detection of BUSCO genes is more likely the result of the evolutionary distance between ciliates and the species used in the previous BUSCO dataset.

We also assessed other ciliate genomes for comparison (see Figure 1, below). The MAC genomes of four reference hypotrich ciliates (Tetmemena sp., Laurentiella sp., Paraurostyla sp. and Urostyla sp., ref. 7), which were previously assessed as "complete" using core eukaryotic genes (CEG), have current BUSCO scores ranging from 88.3% to 94.1% (Figure 1). Urostyla sp., which is the most diverged from Oxytricha trifallax (which is included in the current BUSCO reference dataset), has the lowest BUSCO score among those, consistent with the greater evolutionary distance, and does not reflect a less complete genome. Note that some ciliate genomes have BUSCO scores near 100% because they were included in the BUSCO dataset. Since no Euplotes species were sampled in the BUSCO dataset, we conclude that 88.8% is within the range of a high-quality ciliate MAC genome assembly.

Author response image 1. — The three species in the present manuscript are shown in bold. Species with a * are expected to have high BUSCO scores because they were sampled in the BUSCO reference dataset (alveolata_odb10). MAC genomes included here: Euplotes octocarinatus (8), Euplotes vannus (9), Laurentiella sp., Paraurostyla sp., Urostyla sp. (shown to be complete based on the presence of core eukaryotic genes, 7), Halteria grandinella (Zheng et al., Genbank RRYP01000000; https://journals.asm.org/doi/10.1128/mBio.01964-20), Paramecium tetraurelia (https://paramecium.i2bc.parissaclay.fr/download/Paramecium/tetraurelia/51/annotations/ptetraurelia_mac_51/), *Tetrahymena thermophila* (http://ciliate.org/index.php/home/downloads).

We also provided in the Methods section two other commonly used metrics to assess MAC genome completeness, as in Swart et al (6) and Chen et al. (7): 80.6% of E. woodruffi MAC contigs contain at least one telomere. Furthermore, the E. woodruffi MAC genome contains tRNAs for all 20 amino acids. Both assessments support the conclusion that the E. woodruffi MAC genome assembly is of high quality.

The short N50 of MAC genomes is expected for ciliate MAC genomes with gene-sized chromosomes. We added Figure 2 - figure supplement 2 to the manuscript to show the length distribution of MAC nanochromosomes (telomere-to-telomere assembled contigs). This is itself a slight underestimate of the expected MAC length distribution, because the longest MAC chromosomes have a lower probability of complete assembly. It was also expected that Euplotes species would have shorter MAC nanochromosomes, compared to Oxytricha and Tetmemena, based on prior gel electrophoresis (Swanton, Greslin, and Prescott. 1980. Chromosoma 77:203215), which reported that the distribution of MAC DNA molecules in Euplotes aediculatus appeared to migrate faster than other hypotrichs (see below, Figure 2).

Author response image 2. — Figure 8 in ref. 13 and Swanton, Greslin, and Prescott. 1980. Chromosoma 77:203-215.

MIC genomes, on the other hand, are more challenging to assemble because of the high content of repetitive elements that hinder assembly (Figure 2 - figure supplement 1). In this study, the major use of the MIC genome assembly is to infer MDS and IES annotations. We report that 90% of the MAC chromosomes are well covered (>90%) in the MIC genome (reported in Results). This level of completeness is comparable to that of the Oxytricha trifallax reference genome (1), and therefore is sufficient to permit comparison of rearrangement maps, including scrambled loci.

The gene annotation and ortholog analysis in this study is only performed on MAC genes. The MIC genome is not used for gene or paralog annotation because coding regions are contained in MDSs that are interrupted by IESs and/or scrambled. Therefore, the fragmentation of MIC contigs does not influence the quality of gene annotation or paralog/ortholog assignment. Paralogs were carefully analyzed to include only genes found on telomere-terminating contigs (G₄T₄G₄ or C₄A₄C₄), to be confident that they represent MAC chromosomes. Furthermore, the MAC genome assemblies were clustered to a level of sequence similarity of 95% in order to collapse and therefore exclude allelic differences. For the purpose of studying paralogous MDSs in the MIC genome, we only included paralogous MDSs that map to the same MIC contig. This excludes the possibility of conflating MDSs that map to different contigs. Therefore, the higher levels of paralogy that we report for scrambled genes (both orthogroup size, which entirely derives from the MAC assembly, and numbers of paralogous MDSs identified on MIC contigs) are not likely due to assembly artifacts but meet a stringent quality standard.

Details of gene families identified by OrthoFinder for scrambled and nonscrambled genes were provided in Supplementary File 4. We also provide a new Supplementary File 5 to summarize gene families detected by OrthoFinder for each pair of species. Gene annotations for the three species have been uploaded to https://knot.math.usf.edu//mds_ies_db/2022/downloads.html. Supplementary File 4 and 5 also provide an estimate of "species-specific" genes. We do not observe more species-specific genes among scrambled vs. nonscrambled loci (see the response to point 2 below).

We do not presently report the number of MIC chromosomes in any species, but this will be possible for Oxytricha trifallax from a new study based on Hi-C from another first author. Chen et al. (1) reported an estimate based on collapsing MIC-telomere-containing PacBio reads, but we find that Hi-C will provide a more accurate estimate.

Please spell out more clearly what you consider the evolutionary forces that have increased scrambling. My reading is that you consider many of the scrambled genes not to be functionally very important, which would be in line with there being so few orthologs in other species. To this end, please provide some basic information as to how many of the scrambled genes have expression support, what the distribution of expression quantiles is etc. -- this could help to shore up these inferences.

We do not suggest that scrambled genes are less important than nonscrambled ones. Supplementary File 4 shows that a similar portion of scrambled vs. nonscrambled genes have orthologs in other ciliates. The lack of orthologs in other ciliates may be due to less investigation of closely related species. This could explain why Oxytricha genes, whether scrambled or nonscrambled, have more orthologs detected in other ciliates compared to Euplotes.

Our model suggests that local duplication provides a buffer against mutations, allowing paralogous MDSs to repair the MAC locus during assembly of the scrambled genes. The increased levels of scrambling in the Oxytricha lineage do not need to invoke a fitness advantage, however. It is more likely that a neutral ratchet drives the increase in scrambling, because of the difficulty of "erasing" it from the germline MIC genome once a scrambled architecture has been established, relative to the trend towards shortening MDSs as more mutations accumulate in either paralog, which leads to increased levels of fragmentation (68, 69). We add this to the discussion.

We followed the editor’s suggestion to analyze gene expression data for all species. We collected poly-A enriched mRNA and sequenced three replicates from asexually growing vegetative cells. We find that scrambled and nonscrambled genes have nearly identical levels of expression support (at least one read in all 3 replicates) in both Oxytricha (Supplementary File 8) and Tetmemena. E. woodruffi has more expression support for nonscrambled vs. scrambled genes, on the other hand (Supplementary File 8), which could be due to the more recent acquisition of scrambled loci in E. woodruffi, based on the stronger correlations observed between MDS and IES length and sequence similarity for MDS/IES pairs flanked by the same pointers (Figure 4 - figure supplement 2). Thus, it is possible that their nonscrambled paralogs may still serve the major function.

We also compared the expression levels of scrambled vs. nonscrambled genes. Only genes with a low Coefficient of Variation (CV<1) of TPM (transcripts per million) are included in the comparison (new Figure 4 - figure supplement 3). The distribution of expression levels is similar for scrambled vs. nonscrambled genes (Figure 4 - figure supplement 3). In a Mann-Whitney U test, the average expression level among three replicates is significantly higher in nonscrambled genes for Oxytricha and E. woodruffi, but not significant for Tetmemena.

Finally, the manuscript is overly long, and despite -- or perhaps because -- of this, it is not presented in the most accessible manner. For example, the abstract still struggles to explain the importance or relevance of the work. e.g. the last sentence is "Scrambled loci are more often associated with local duplications, supporting a simple model for the origin of scrambled genes via DNA duplication and decay". This would need some motivation, i.e., why should we care about scrambled genes per se? Similarly, why does the Introduction have an almost 1-page summary of the results?

We provide additional motivation for the study of scrambled genomes in the beginning of the abstract and the introduction. Oxytricha's scrambled genome and others in its lineage represent some of the most complex genome architectures known in any organism, with hundreds of thousands of precise, programmed genome editing events required to assemble coding regions. Recent exciting reports have described scrambled genomes in metazoa, including cephalopods, but those evolutionary events entail only shuffling of gene order, with no accompanying genome editing, so they are less complex, and other examples of programmed genome rearrangement in both protists and metazoa are usually simple DNA deletions.

We shortened other portions of the manuscript, eliminating or combining paragraphs where feasible, and added more motivation throughout, particularly emphasizing the compelling ways in which the Oxytricha lineage presents the most complex natural genome editing pathway known to date. Hence it is important to ask how such complex genomes have arisen, and comparative genomics provides fundamental insight into this question.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Citations

Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. Euplotes woodruffi genome sequencing and assembly. NCBI BioProject. PRJNA781979
Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. Tetmemena sp. micronucleus genome sequencing and assembly. NCBI BioProject. PRJNA694964
Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. Euplotes woodruffi strain:Iz01. NCBI BioProject. PRJNA781602
Feng Y, Neme R, Beh LY, Chen X, Braun J, Lu MW, Landweber LF. 2022. RNA-seq of Tetmemena sp. NCBI BioProject. PRJNA887426
Channagiri T, Braun J, Feng Y, Landweber LF. 2022. MDS-IES database. MDSIESDB. db/2022/downloads
Feng Y, Neme R, Beh L, Chen X, Braun J, Lu M, Landweber L. 2022. MDS and IES annotations for Euplotes woodruff, Tetmemena sp. and Oxytricha trifallax. Dryad Digital Repository. [DOI]
Feng Y, Neme R, Beh LY, Chen X, Braun J, MW Lu, Landweber LF. 2022. Tetmemena sp. Micronucleus genome. NCBI GenBank. JAJKFJ000000000
Feng Y, Neme R, Beh LY, Chen X, Braun J, MW Lu, Landweber LF. 2022. Euplotes woodruffi Micronucleus genome. NCBI GenBank. JAJLLS000000000
Feng Y, Neme R, Beh LY, Chen X, Braun J, MW Lu, Landweber LF. 2022. Euplotes woodruffi Macronucleus genome. NCBI GenBank. JAJLLT000000000.
Chen X, Bracht JR, Goldman AD, Dolzhenko E, Clay DM, Swart EC, Perlman DH, Doak TG, Stuart A, Amemiya CT, Sebra RP, Landweber LF. 2014. Oxytricha trifallax micronucleus genome. NCBI Assembly. GCA_000711775.1
Swart EC, Bracht JR, Magrini V, Minx P, Chen X, Zhou Y, Khurana JS, Goldman AD, Nowacki M, Schotanus K, Jung S, Ly A, McGrath S, Haub K, Wiggins JL, Storton D, Matese JC, Parsons L, Chang WJ, Bowen MS, Stover NA, Jones TA, Eddy SR, Herrick GA, Doak TG, Wilson RK, Mardis ER, Landweber LF. 2013. Oxytricha trifallax macronucleus genome. NCBI Assembly. GCA_000295675.1
Chen X, Jung S, Beh LY, Eddy SR, Landweber LF. 2015. Tetmemena sp. macronucleus genome. NCBI Assembly. GCA_001273295.2
Beh LY, Debelouchina GT, Clay DM, Thompson RE, Lindblad KA, Hutton ER, Bracht JR, Sebra RP, Muir TW, Landweber LF. 2019. Genome-wide analysis of chromatin and transcription in the ciliates Oxytricha trifallax and Tetrahymena thermophila. NCBI Gene Expression Omnibus. GSE94421

Supplementary Materials

Figure 5—source data 1. Pointers conserved in all three species.

elife-82979-fig5-data1.xlsx^{(12.3KB, xlsx)}

Figure 5—source data 2. The telomere-bearing element (TBE) pointers in Oxytricha that are conserved with non-TBE pointers in Tetmemena.

elife-82979-fig5-data2.xlsx^{(14.7KB, xlsx)}

Supplementary file 1. Sequencing depth statistics for germline micronucleus (MIC) genome assemblies.

*Sequencing data from Chen et al., 2014.

elife-82979-supp1.docx^{(14.6KB, docx)}

Supplementary file 2. Subcategories of repeat content in the three species.

elife-82979-supp2.docx^{(15KB, docx)}

Supplementary file 3. Telomere-bearing elements (TBE)/transposon of Euplotes crassus (TEC) elements open reading frames in three species.

* Differs from 10,109 in Chen et al. (Chen and Landweber, 2016) because we used different versions of BLAST and custom python scripts to identify complete TBEs (see Methods).

elife-82979-supp3.docx^{(14.2KB, docx)}

Supplementary file 4. Orthology among scrambled and nonscrambled genes in the three species.

* Ciliate database is generated by extracting all protein sequences in phylum Ciliophora (taxid: 5878) from NR database.

elife-82979-supp4.docx^{(15.9KB, docx)}

Supplementary file 5. Summary of orthologs in each pair of species.

The (i,j) cell shows the number of genes in species i with an ortholog in species j.

* Genes with no ortholog detected by OrthoFinder (Emms and Kelly, 2019) in the other two species.

elife-82979-supp5.docx^{(14.6KB, docx)}

Supplementary file 6. More scrambled somatic macronucleus (MAC) contigs contain at least one paralogous macronuclear destined sequence that may be involved in alternative rearrangement.

elife-82979-supp6.docx^{(14.5KB, docx)}

Supplementary file 7. Macronuclear destined sequence (MDS)-internally eliminated sequence (IES) pairs share homologous sequences in the three species (related to Figure 4—figure supplement 2).

elife-82979-supp7.docx^{(15.6KB, docx)}

Supplementary file 8. Genes with expression support in the three species.

elife-82979-supp8.docx^{(14.1KB, docx)}

Supplementary file 9. Presence of conserved pointers in three species, with Monte Carlo simulations.

elife-82979-supp9.docx^{(14.9KB, docx)}

Supplementary file 10. Scrambled pointers are more conserved than nonscrambled pointers.

elife-82979-supp10.docx^{(14.3KB, docx)}

Supplementary file 11. Most pointers conserved in position are different in sequence.

elife-82979-supp11.docx^{(14KB, docx)}

Supplementary file 12. Intron-IES conversion comparison in three species and Monte Carlo simulations.

elife-82979-supp12.docx^{(14.9KB, docx)}

Supplementary file 13. Pairwise intron-IES conversion comparisons and Monte Carlo simulations.

elife-82979-supp13.docx^{(15KB, docx)}

Supplementary file 14. PCR primers for validation of the Russian doll region in Tetmemena DNA (Figure 6A).

elife-82979-supp14.docx^{(14.9KB, docx)}

MDAR checklist

elife-82979-mdarchecklist1.docx^{(106.4KB, docx)}

Data Availability Statement