Significance
Endosymbiotic organelles like mitochondria and plastids fundamentally shaped eukaryotic evolution. Genetic integration, the host control of endosymbiont functions via endosymbiotic gene transfer and import of host-encoded proteins, is ubiquitously observed in these classic organelles and was hypothesized to be necessary for establishing permanent endosymbiotic compartments. We report genome analyses of Epithemia diatoms, eukaryotic algae which contain recently acquired, nitrogen-fixing, bacterial endosymbionts called diazoplasts to investigate their evolution. We observed minimal evidence of genetic integration of diazoplasts despite having been maintained over millions of years of Epithemia speciation. Our findings demonstrate that genetic integration is not required for establishing permanent endosymbiotic compartments. Epithemia diazoplasts contrast with traditional organelles and provide a more viable path to bioengineering new membrane compartments.
Keywords: endosymbiosis, organellogenesis, diatom, genome, cyanobacteria
Abstract
Endosymbiotic gene transfer (EGT) and import of host-encoded proteins have been considered hallmarks of organelles necessary for stable integration of two cells. However, newer endosymbiotic models have challenged the origin and timing of such genetic integration during organellogenesis. Epithemia diatoms contain diazoplasts, obligate endosymbionts derived from cyanobacteria that are closely phylogenetically related to UCYN-A, a recently described nitrogen-fixing organelle. Diazoplasts function as permanent membrane compartments in Epithemia hosts, but it is unknown if genetic integration has occurred. We report genomic analyses of two Epithemia diatom species, freshwater Epithemia clementina and marine E. pelagica, which are highly divergent but share a common ancestor at the origin of the endosymbiosis <35Mya. We find minimal evidence for genetic integration. Segments of fragmented and rearranged DNA from the diazoplast were detected integrated into the E. clementina nuclear genome, but the transfers that have occurred so far are nonfunctional. No DNA or gene transfers were detected in E. pelagica. In E. clementina, 6 host-encoded proteins of unknown function were identified in the diazoplast proteome, far fewer than detected in recently acquired endosymbiotic organelles. Overall, Epithemia diazoplasts are a valuable counterpoint to existing organelle models, demonstrating that endosymbionts can function as integral compartments—maintained over millions of years of host speciation—absent significant genetic integration. The minimal genetic integration makes diazoplasts valuable blueprints for bioengineering endosymbiotic compartments de novo.
Endosymbiotic organelles are uniquely eukaryotic innovations for the acquisition of complex cellular functions, including aerobic respiration, photosynthesis, and nitrogen fixation. Endosymbioses contributed to expansive eukaryotic diversity (1). An important question in cell evolution and engineering is: How do intermittent, facultative endosymbioses evolve into permanent integral cell compartments? With the recognition of the bacterial origin of mitochondria and chloroplasts in the 1980s, Cavalier-Smith & Lee proposed that the key distinction between a transient endosymbiont and an organelle was that organelles do not synthesize all their own proteins (2, 3). Instead, in organelles, some genes are transferred from the endosymbiont genome to the eukaryotic nucleus in a process called endosymbiotic gene transfer (EGT). These and other gene products, now under the control of host gene expression, are imported back into the endosymbiotic compartment to regulate endosymbiont growth and division. This definition has since been commonly applied (4, 5). However, the underlying hypothesis for organelle evolution—that genetic integration resulting from EGT and/or import of host-encoded gene products is essential for maintaining the endosymbiont as an integral cellular compartment—has not been rigorously tested.
In the decades since, increased sampling of eukaryotic diversity has uncovered evidence that, among microbes, endosymbioses are a common strategy for acquisition of new functions. Based on observations of EGT and host protein import, new organelles have been recognized: the chromatophore in Paulinella chromatophora (6–9) and UCYN-A in Braarudosphaera bigelowii (10). EGT and host protein import have also been observed in obligate, vertically inherited nutritional endosymbionts of the parasite Angomonas deanei and insects, which are not formally recognized as organelles (11, 12). With the benefit of these newer models, our understanding of genetic integration has become more nuanced (13–15). For example, the majority of host proteins imported into the Paulinella chromatophore do not originate from EGT but rather horizontal gene transfer (HGT) from other bacteria or eukaryotic genes (16), showing that a host’s repertoire of preexisting genes may play an outsized role in facilitating genetic integration (17). UCYN-A was initially described as having an unstable relationship with its host as it is often lost from host cells during isolation and in culture, suggesting environmental conditions can affect endosymbiont stability despite genetic integration (18). There have been bigger surprises: Some organisms temporarily acquire plastids from partially digested prey algae (19, 20). The retained chloroplasts, called kleptoplasts, perform photosynthesis and, in several species, depend on imported host proteins to fill gaps in their metabolic pathways. Despite their genetic integration, these kleptoplasts cannot replicate in the host cell and are not required for host cell survival, indicating that genetic integration is not sufficient to achieve stable integration of the endosymbiont (21–23). These findings highlight the importance of studying biodiverse organisms to inform new hypotheses for endosymbiotic evolution.
Among new model systems, Epithemia spp. diatoms offer a unique perspective on organellogenesis. These photosynthetic microalgae contain diazotroph endosymbionts (designated diazoplasts) that perform nitrogen fixation, a biological reaction that converts inert atmospheric nitrogen to bioavailable ammonia (24–29). The ability to fix both carbon and nitrogen fulfills a unique niche in ecosystems. Numerous Epithemia species are globally widespread in freshwater habitats and have recently been isolated from marine environments (30–32). The Epithemia endosymbiosis is very young relative to mitochondria and chloroplasts, having originated ~35 Mya, based on fossil records (33). Nonetheless, diazoplasts are obligate endosymbionts which are coordinately inherited during host cell division and present in all Epithemia species described so far, indicating a single, shared origin of the endosymbiosis and subsequent coevolution of diazoplasts and their host algae. Finally, Epithemia diazoplasts are phylogenetically related to UCYN-A, the diazotroph endosymbiont of B. bigelowii which was recently designated the first nitrogen-fixing organelle, or nitroplast (10, 29, 34). Both Epithemia diazoplasts and UCYN-A evolved from free-living Crocosphaera cyanobacteria relatively recently (35Mya and 91Mya respectively) (33, 35). The independent evolution of free-living Crocosphaera into diazotroph endosymbionts in multiple lineages of host microalgae enables comparisons that can lead to powerful insights.
If the significance of organelles lies in their function as integral cellular compartments, then metabolic and cellular integration with the host cell are paramount (36). By these criteria, diazoplasts show a level of host-symbiont integration comparable to UCYN-A. Nitrogen fixation requires large amounts of ATP and reducing power, energy that can be supplied by photosynthesis. Yet nitrogenase, the enzyme that catalyzes nitrogen fixation, is exquisitely sensitive to oxygen produced during oxygenic photosynthesis. In free-living Crocosphaera, photosynthesis and nitrogen fixation are temporally separated such that fixed carbon from daytime photosynthesis is stored as glycogen to fuel exclusively nighttime nitrogen fixation. Diazoplasts have lost all photosystem genes and depend entirely on host photosynthesis for fixed carbon (28, 37). Recently, we showed that host and diazoplast metabolism are tightly coupled to support continuous nitrogenase activity throughout the day-night cycle in Epithemia clementina: Diatom photosynthesis is required for daytime nitrogenase activity in the diazoplast, while nighttime nitrogenase activity also depends on diatom, rather than diazoplast, carbon stores (29). In comparison, UCYN-A has lost only photosystem II and is dependent on both host photosynthesis and, likely, its own photosystem I, restricting it to daytime nitrogen fixation (38, 39).
Epithemia diazoplasts also show robust cellular integration, typically containing 1 to 2 diazoplasts per cell that are vertically inherited during asexual cell division (26, 29, 31). Diazoplasts are also uniparentally inherited during sexual reproduction, similar to mitochondria and chloroplasts (40). Coordinated replication of UCYN-A with host cell division has been observed to maintain a single endosymbiont per cell (10). Similar mechanisms are likely in place to coordinate diazoplast inheritance with diatom division. In fact, the presence of diazoplasts in diverse Epithemia species globally widespread in freshwater and marine ecosystems demonstrates that the mechanisms of inheritance are robust through speciation events. Diazoplasts effectively serve as dedicated nitrogen-fixing compartments in Epithemia, whether or not genetic integration has occurred.
An important question emerges from these observations: is EGT and/or host protein import required to achieve the level of host-symbiont integration observed between diazoplasts and host Epithemia? Based on the similarity of diazoplasts to UCYN-A, the assumption is yes. However, there is evidence that metabolite exchange via endosymbiont-encoded transporters (41) and division coordinated by host proteins outside the endosymbiotic compartment (11, 42–44) could form a stable compartment without genetic integration. We previously established freshwater E. clementina as a laboratory model for functional studies and herein performed de novo assembly and annotation of its genome. The genome sequence for E. pelagica, a recently discovered marine species, was publicly released by the Wellcome Sanger Institute (31, 45). To facilitate comparison between these species, we also performed de novo genome annotation of E. pelagica. Notably, no genomes of B. bigelowii (which hosts UCYN-A) nor the eukaryotic host in any other diazotroph endosymbiosis have been available. We report genome and transcriptome analyses of these two Epithemia species as well as proteome analyses of E. clementina with the goals of 1) providing a necessary resource to accelerate investigation of this model and 2) elucidating the role of genetic integration in this very young, stably integrated endosymbiont.
Results
The Genome of E. clementina Is Larger and More Repetitive than E. pelagica.
Epithemia spp. are raphid, pennate diatoms composed of at least 50 freshwater species and 2 reported marine species (30, 31). Isolation and characterization of freshwater E. clementina was previously reported (29) (SI Appendix, Fig. S1A). We isolated high molecular weight DNA from axenic E. clementina cultures, performed sequencing by long-read Nanopore and short-read Illumina and assembled a 418 Mbp haploid assembly with a high level of heterozygosity of 1.48% (SI Appendix, Fig. S1B). The final reported haploid assembly is complete, contiguous, and of high sequence quality (Table 1 and SI Appendix, Fig. S1C). A chromosome-level 60 Mbp genome assembly (GCA_946965045) of E. pelagica, a marine species, was reported by the Sanger Institute (45). Whole-genome alignments of E. clementina and E. pelagica did not show significant syntenic blocks in their nuclear genomes (SI Appendix, Fig. S1E). In contrast, their diazoplast genomes showed 5 major and 2 minor syntenic blocks (SI Appendix, Fig. S1F), similar to the synteny reported between diazoplast genomes of other Epithemia species (34, 37).
Table 1.
Epithemia genome assembly statistics
| E. clementina | E. pelagica | |
|---|---|---|
| Genome size (bp) | 418,007,894 | 60,195,788 |
| GC | 44.3% | 48.19% |
| QV | 38.52 | – |
| Contig/chromosome # | 642 | 15 |
| N50 | 1,108,441 | – |
| L90 | 412 | – |
| Gene # | 26,453 | 20,203 |
| Repeat % | 80% | 27.36% |
| BUSCOgenome | 100% | 100% |
| BUSCOprotein | 99% | 94% |
| Diazoplast genome size (bp) | 3,072,807 | 2,483,960 |
| Diazoplast gene # | 1,910 | 1,679 |
Summary of assembly statistics for E. clementina and, where applicable, E. pelagica. Quality value (QV) represents a log-scaled estimate of the base accuracy across the genome, where a QV of 40 is 99.99% accurate. N50 and L90 are measures of genome contiguity. N50 represents the contig length (bp) such that 50% of the genome is contained in contigs ≥N50. L90 represents the minimum number of contigs required to contain 90% of the genome. Finally, BUSCO (Benchmarking of Single Copy Orthologues) is an estimate of completeness of the genome (BUSCOgenome) and proteome (BUSCOprotein) of E. clementina and E. pelagica. “–” indicates no statistic.
Both E. clementina and E. pelagica genomes were annotated using evidence from protein orthology and transcriptome profiling. The nuclear genomes were predicted to contain 20,203 genes in E. pelagica and 26,453 genes in E. clementina (Fig. 1A). The completeness of their predicted proteomes was assessed based on the presence of known single-copy orthologs in stramenopiles, yielding BUSCOprotein scores of 99% for E. clementina and 94% for E. pelagica (Table 1 and SI Appendix, Fig. S1D). While the gene numbers between E. clementina and E. pelagica are similar and typical for diatoms, the genome of E. clementina is 7 times larger (Fig. 1 A and B and Table 1). The increased genome size is due to a substantial repeat expansion unique to E. clementina (Fig. 1 B and C). Notably, the differences in genome size observed among diatoms is largely due to repeat content (Fig. 1B). Multiple LTR families and DNA transposons show expansions that contribute to the high repeat percent in E. clementina (Fig. 1C and SI Appendix, Table S2). This cross-family expansion may indicate a history of relaxed selection on the repeatome of E. clementina, perhaps related to a species bottleneck at the transition to freshwater habitats.
Fig. 1.

Highly divergent Epithemia clementina and E. pelagica genomes share many unique gene families. (A) Genome size and total gene number for published diatom genomes compared with Epithemia species (dark blue). (See also SI Appendix, Fig. S1 and Table S1.) (B) Comparison of repeat content in diatom genomes showing size of the whole genome (gray dots) or the genome excluding repeat elements (orange dots). X-axis is the same as 1C. (C) Breakdown of repeat types in diatom genomes showing amount in Mbp of the genome occupied by repeat elements of specific class, indicated by color. (See also SI Appendix, Table S2.). (D) Cumulative distribution of amino acid identity between pairwise orthologs from reference species. Estimated divergence time of species pair is indicated (Right bar graph). (See also SI Appendix, Fig. S2.). (E) UpSet plot depicting the number of uniquely shared orthogroups between all diatom species (first column) or subsets of 2 to 4 species. Orthogroups shared by E. pelagica and E. clementina are highlighted in brown. Columns are ranked by the number of uniquely shared orthogroups. (See also SI Appendix, Fig. S2.)
Highly Divergent E. clementina and E. pelagica Genomes Share Many Unique Gene Families.
As a measure of divergence at a functional level, we compared the amino acid identity between orthologs across proteomes of several pairs of representative diatom and metazoan species (Fig. 1D and SI Appendix, Table S1). Despite their estimated 35 Mya of speciation, E. pelagica and E. clementina showed a similar distribution of identity across protein orthologs as humans and pufferfish (Homo sapiens and Takifugu rubripes), which are estimated to have shared a common ancestor 429 Mya (46). This rapid divergence, relative to age, is also observed in other diatom species, for example, Thalassiosira pseudonana/Thalassiosira oceanica (70 Mya) and P. multistriata/P. multiseries (6.3 Mya) (46) (Fig. 1D and SI Appendix, Fig. S2A). The loss of synteny, substantial differential repeat expansion, and low protein orthologue identity suggest that E. pelagica and E. clementina have diverged substantially during speciation, reflecting the rapid evolution rates of diatoms (47, 48) (Fig. 1D and SI Appendix, Fig. S2A).
Because rapid divergence is common across diatoms, we evaluated the gene content of E. pelagica and E. clementina in comparison with other diatom species with complete genomes available (SI Appendix, Table S1). Gene families, defined by orthogroups, were identified for each species. We then quantified the number of uniquely shared gene families between subsets of diatom species, i.e. gene families shared between that group of species and not found in any other diatoms. Of 10,740 and 10,612 gene families identified in E. clementina and E. pelagica respectively, they share 8,942 gene families, a greater overlap than is observed between any other pair of diatom species (SI Appendix, Fig. S2A). Of these, 1,109 gene families are uniquely shared between E. clementina and E. pelagica, more than any other species grouping including the more recently speciated Pseudo-nitzschia species (Fig. 1E). This Epithemia-specific gene set is significantly enriched for functions relating to carbohydrate transport and membrane biogenesis, which may have been important for adaptation to the endosymbiont (SI Appendix, Fig. S2B).
Because HGT is known to be a source of genes for endosymbiont functions expressed by the host (9) and 3 to 5% of diatom proteomes have been attributed to bacterial HGT (49), a significantly greater proportion than detected in other eukaryotic proteomes (50), we identified candidate HGTs in the Epithemia genomes. A total of 118 and 97 candidate HGTs were identified in E. clementina and E. pelagica respectively, of which 51 are uniquely shared within Epithemia (SI Appendix, Fig. S2 C, D and F and Table S3). This is a greater overlap than would be expected by gene family overlap alone and a greater number than shared by other diatom species pairs. Notably, HGTs identified in E. clementina and E. pelagica were enriched for genes of cyanobacterial origin compared to other diatoms (11% and 20%, compared to a diatom-wide mean of 4.5%) (SI Appendix, Fig. S2E). Overall, the uniquely shared features of the divergent nuclear genomes of Epithemia genus are valuable for identifying potential signatures of endosymbiotic evolution.
Diazoplast-To-Nucleus Transfer of DNA Is Actively Occurring in E. clementina.
Having broadly compared the Epithemia genomes for shared features, we turned to specifically interrogate genetic integration between Epithemia and their diazoplasts. We specifically distinguished EGT, which we defined as the transfer of functional genes from the endosymbiont to the host nucleus, from endosymbiont-to-nucleus transfers of DNA, which is believed to be more frequent and often nonfunctional. Indeed, it has been shown that nuclear integrations of organellar DNA originating from mitochondria (designated NUMT) and plastids (NUPT) still occur (51, 52). Given the significantly younger age of the diazoplast, it is not clear whether nuclear integrations of diazoplast DNA (which we will refer to as NUDT) and/or functional transfers of genes (EGT) have occurred. To identify transfers of endosymbiont DNA to the host nucleus, we performed homology searches against the nuclear assemblies of E. clementina and E. pelagica. As queries, we used the diazoplast genomes of 4 Epithemia species (including E. clementina and E. pelagica) and the 5 most closely related free-living cyanobacteria species for which whole genomes were available (SI Appendix, Table S1). To prevent spurious identifications, alignments were excluded if they were <500 contiguous base pairs in length. In E. clementina, we identified seven segments, ranging from 1700 to 6400 bp, with homology to the E. clementina diazoplast (Fig. 2 A and B and SI Appendix, Fig. S4 A–G). No homology to free-living cyanobacteria genomes was detected. The E. coli genome and a reversed sequence of the E. pelagica diazoplast were used as negative control queries and yielded no alignments. Finally, no regions of homology to any of the queries were detected in the nuclear genome of E. pelagica.
Fig. 2.

Detection of nuclear integrations of diazoplast DNA (NUDTs). (A) A representative, NUDT containing E. clementina nuclear genome locus on contig ctg002090. Tracks shown from Top to Bottom: nuclear subregion being viewed (red box) within the contig (black rectangle); length of the subregion, with ticks every 500 bp; nanopore sequencing read pileup, showing long read support across the NUDT; location of repeat masked regions (dark gray bars); locations of homology to E. clementina diazoplast identified by BLAST, demarcating the NUDT (blue shade); regions of homology to the E. clementina diazoplast identified by minimap2 alignment, colors represent SNVs between the diazoplast and nuclear sequence. (B) Same as A, for the NUDT on contig ctg003780. (C) Circlize plot depicting the fragmentation and rearrangement of NUDTs. The diazoplast genome (blue) and the NUDT on contig ctg002090 (brown) with chords connecting source diazoplast regions to their corresponding nuclear region, inversions in red. The length of the NUDT is depicted at 100x true relative length for ease of visualization. (D) Same as C, for the NUDT on contig ctg003780. (E) Ratio of long read depth of NUDT compared to average read depth for the containing contig. Heterozygous insertions (light gray bars) show approximately 0.5× depth; homozygous insertions (black bars) show approximately 1.0× depth. (F) GC content of NUDTs, compared to mean GC content for 5 kb sliding windows of the diazoplast genome (blue dashed line) and the nuclear genome (brown dashed line). Shaded regions represent mean ± 1 SD.
NUDTs showed features suggesting they were distinct from diazoplast genomic sequences and unlikely to be assembly errors. First, 4 of the 7 NUDTs were supported by long reads equivalent to 1x coverage of the genome indicating the insertions were homozygous. 3 NUDTs contained on ctg003410, ctg001640, ctg005680 showed the equivalent of 0.5x genome coverage, consistent with a heterozygous insertion in the diploid eukaryotic genome (Fig. 2E and SI Appendix, Fig. S4 A, B, and D). Second, NUDTs had low GC content similar to that of the diazoplast but contain many single nucleotide variants (SNVs) with mean identity of 98.4% to their source sequences, indicative of either neutral or relaxed selection (Figs. 2F and 3B). Finally, each NUDT was composed of multiple fragments corresponding to distal regions in the endosymbiont genome, ranging from as few as 8 distal fragments composing the NUDT on ctg002090 to as many as 42 on ctg003780 (Fig. 2 C and D, and SI Appendix, Fig. S3 A–E). This composition of NUDTs indicates either that fragmentation and rearrangement of the diazoplast genome occurred prior to insertion into the eukaryotic genome or that NUDTs were initially large insertions that then underwent deletion and recombination. Overall, the detection of NUDTs shows that diazoplast-to-nucleus DNA transfer is occurring in this very young endosymbiosis.
Fig. 3.

Most NUDTs are decaying and nonfunctional. (A) Truncation of diazoplast genes contained within each NUDT relative to the full-length diazoplast gene. (B) Nucleotide identity of diazoplast genes that are <30% truncated (points) contained within each NUDT compared to identity of the full containing NUDT sequence (bars). (C) Normalized expression across each NUDT (blue highlight) ± 1 kb of the genomic region surrounding the NUDT. For each NUDT, a pair of tracks shows RNA-seq reads after polyA enrichment of whole RNA plotted within background signal range, from 0 to 2 BPM (Top, gray) and RNA-seq reads after rRNA depletion of whole RNA, plotted from 0 to 7 BPM (Bottom, black). The region corresponding to the tusA gene in ctg005680 is highlighted in dark blue.
Most NUDTs are Decaying and Nonfunctional.
To determine whether any of the identified NUDTs have resulted in EGT, we identified diazoplast genes present in NUDTs and evaluated their potential for function. A total of 124 diazoplast genes and gene fragments were carried over into the NUDTs (Fig. 3A). [A few of these diazoplast genes have conserved eukaryotic homologs and were also predicted as eukaryotic genes in the E. clementina genome annotation (SI Appendix, Fig. S4 A–G).] 121 diazoplast genes detected in NUDTs are truncated >30% compared to the full-length diazoplast gene (Fig. 3A). Of the three remaining, two genes contained on ctg002090 showed accumulation of SNVs that resulted in a premature stop codon and a nonstop mutation (Fig. 3B, S3F and S3G). We performed transcriptomics to assess the expression from NUDTs. Neither of the two genes on ctg002090 showed appreciable expression. All except one NUDT showed <2 bins per million mapped reads (BPM), equivalent to background transcription levels within the region (Fig. 3C and SI Appendix, Fig. S4 A–G). The truncation, mutation accumulation, and lack of appreciable expression of diazoplast genes encoded in NUDTs suggest that most are nonfunctional.
Only a single EGT candidate was detected contained on ctg005680: an intact sulfotransferase (tusA) gene that is 100% identical to the diazoplast-encoded gene (Fig. 3 A and B and SI Appendix, Fig. S4A). The NUDT that contains this candidate appears to be very recent as it is heterozygous and shows 99.7% identity to the source diazoplast sequence (Fig. 3B). Interestingly, tusA is implicated in Fe-S cluster regulation that could be relevant for nitrogenase function. Due to the high sequence identity, it is not possible to distinguish transferred tusA from that of tusA encoded in the diazoplast genome by sequence alone. However, transcript abundance above background levels was only detected in rRNA depleted samples that contain diazoplast transcripts and not in polyA-selected samples that remove diazoplast transcripts, indicating that the observed expression is largely attributed to diazoplast-encoded tusA (Fig. 3C). Moreover, host proteins imported to endosymbiotic compartments often use N-terminal (occasionally C-terminal) targeting sequences (10, 53, 54). We were unable to identify any added sequences in the transferred tusA indicative of a targeting sequence; the sequence immediately surrounding consisted only of native diazoplast sequence carried over with the larger fragment (SI Appendix, Fig. S4A). Though there is no evidence for gene function, the transfer of this intact gene indicates that the conditions for EGT are present in E. clementina (4).
Few Host-Encoded Proteins Are Detected in the Diazoplast Proteome.
The critical step in achieving genetic integration is evolution of pathways for importing host proteins into the endosymbiont. While EGT and HGT from other bacteria can expand the host’s genetic repertoire, neither transferred genes nor native eukaryotic genes can substitute for or regulate endosymbiont functions unless the gene products are targeted to the endosymbiotic compartment. Abundant host-encoded proteins were detected in the proteomes of recently acquired endosymbionts that have been designated organelles: 450 in the chromatophore of Paulinella (9) and 368 in UCYN-A (10). In both the chromatophore and UCYN-A, several host-encoded proteins detected in the endosymbiont fulfill missing functions that complete endosymbiont metabolic pathways, providing further support for the import of host-encoded proteins.
To determine whether host protein import is occurring in the diazoplast, we identified the proteome of the E. clementina diazoplast. We were unable to maintain long-term E. pelagica cultures to perform proteomics for comparison. Diazoplasts were isolated from E. clementina cells by density gradient centrifugation (Fig. 4A). The purity of isolated diazoplasts was evaluated by light microscopy. The protein content of isolated diazoplasts and whole E. clementina cells containing diazoplasts were determined by LC–MS/MS. A total of 2481 proteins were identified with ≥2 unique peptides: 754 proteins were encoded by the diazoplast genome (detected/total protein coding = 43% coverage) and 1727 proteins encoded by the nuclear genome (6.5% coverage) (Fig. 4B and SI Appendix, Table S4). Of note, TusA, the only EGT candidate identified, was not detected in either proteome. To identify proteins enriched in either the diazoplast or host compartments, we compared protein abundance in isolated diazoplasts and whole cell samples across 3 biological replicates (Fig. 4C). 492 diazoplast-encoded proteins were significantly enriched in the diazoplast and none were enriched in whole cell samples, supporting the purity of the isolated diazoplast sample. Similarly, most host-encoded proteins (1281) were significantly enriched in whole cell samples, indicating localization in host compartments. Six unique host-encoded proteins were significantly enriched in diazoplast samples, suggesting possible localization to the diazoplast. Five were encoded by Ec_g00815, Ec_g12982, Ec_g13000, Ec_g13118, and Ec_g25610. The sixth protein was encoded by two identical genes, Ec_g24166 and Ec_g03819, resulting from an apparent short duplication of it and two neighboring genes. Because the duplication makes Ec_g24166 and Ec_g03819 indistinguishable by amino acid sequence, we considered them one import candidate. Of the 6 host protein import candidates, Ec_g25610 and Ec_g13000 were detected only in the diazoplast sample, while the rest were identified in both diazoplast and whole cell samples. We are unable to rule out the possibility of nonspecific enrichment since neither genetics nor immunofluorescence are available in E. clementina to further validate their protein localization.
Fig. 4.

Few host-encoded proteins are detected in the diazoplast proteome. (A) Electron micrographs of (Top) E. clementina cells with diazoplast (D), chloroplast lobes (C), and lipid bodies (L) indicated and (Bottom) diazoplasts following purification with thylakoids (yellow arrow) indicated. (B) Number of diazoplast-encoded (Left) and host-encoded (Right) proteins identified by LC–MS/MS. Total number of proteins identified from each respective proteome is shown above each stacked bar. Colored bars and numbers indicate proteins identified in purified diazoplasts only, whole cells only, or both. (C) Volcano plot showing the enrichment of diazoplast-encoded (blue) and host-encoded (brown) proteins in whole cells or purified diazoplasts, represented by the difference between log2-transformed iBAQ values. Proteins enriched in the diazoplast are on the Left side of the graph while those enriched in the host are on the Right; the darker shade of each color represents significantly enriched hits. Host-encoded proteins significantly enriched in the diazoplast are shown with larger brown markers.
Since the sensitivity of proteomics is highly dependent on biomass, we estimated the coverage of the diazoplast proteome based on the ratio of diazoplast-encoded proteins detected (754) compared to the total diazoplast protein-coding genes (1585). The coverage of the diazoplast proteome (48%) was comparable to the coverage of the published chromatophore proteome (422/867 = 49%) and that of UCYN-A (609/1186 = 51%). Proteome coverage therefore does not account for differences in the number of import candidates (9, 10). Rather, the detection of ~100 fold fewer import candidates in the diazoplast indicate that host protein import, if occurring, is far less extensive than in the chromatophore or UCYN-A.
Import Candidates Do Not Have Clear Functions in Diazoplast Pathways.
We sought additional evidence to support the import of these host proteins by evaluating their potential functions (55). No domains, GO terms, or BLAST hits (other than to hypothetical proteins found in other diatoms) were identified for any of the candidates except for Ec_g13118 which is annotated as an E3 ubiquitin ligase. In contrast, several host proteins detected in the chromatophore and UCYN-A proteome were assigned to conserved cyanobacterial growth, division, or metabolic pathways in these organelles. Moreover, none of the candidates for import into the diazoplast have homology to proteins encoded in diazoplast or free-living Crocosphaera genomes to suggest they might fulfill unidentified cyanobacterial functions. Instead, all candidates are diatom-specific proteins: 3 candidates (Ec_g24166/Ec_g03819, Ec_g12982, and Ec_g13000) belonged to orthogroup OG0000250 which is uniquely shared with E. pelagica but no other diatoms (Fig. 1E). The remaining 3 belonged to separate orthogroups (OG0001966, OG0004498, and OG0009247) which are shared broadly among diatoms including E. pelagica. Alignments of these candidates with their orthologs did not show N- or C-terminal extensions consistent with possible targeting sequences for endosymbiont localization. Even if imported into the diazoplast, we hypothesize that exclusively eukaryotic proteins, especially diatom-specific proteins, are less likely to substitute for endosymbiont functions than proteins conserved in bacteria. Our functional analysis does not provide support for the import of these host proteins or their function in cyanobacterial pathways.
We also took a targeted approach to look for host proteins that might fulfill missing endosymbiont functions. Among host proteins identified in the UCYN-A proteome, several have functional annotations that indicate they substitute for genes missing from the UCYN-A genome. We specifically identified the presence of these genes in diazoplast genomes of E. clementina, E. pelagica, E. gibberula, and E. turgida. In contrast to their absence in the UCYN-A genome, IspD, ThrC, PGLS, and PyrE are retained in diazoplast genomes, obviating the need for host proteins to fulfill their functions (SI Appendix, Fig. S5). One gene missing from UCYN-A and all diazoplast genomes is phosphofructokinase (PFK), a gatekeeper enzyme in glycolysis (SI Appendix, Fig. S5). In both UCYN-A and diazoplast genomes, all other enzymes of the glycolysis pathway are present. A host protein encoding PFK was detected in the UCYN-A proteome, indicating it likely substitutes for cyanobacterial PFK to complete glycolysis. The E. clementina genome contains 3 annotated PFK genes: Ec_g17255, Ec_g24955, and Ec_g25659. We specifically looked for these EcPFKs in diazoplast versus whole cell samples. Ec_g17255 was detected in all three whole cell replicates and one diazoplast replicate. Ec_g25659 was identified in 2 of 3 whole cell samples but not in any diazoplast samples. Finally, Ec_g24955 was not uniquely identified, as only a single peptide shared with Ec_g25659 was detected in whole cell samples. These results do not support EcPFKs being imported into diazoplasts and are consistent with observations that diazoplasts rely on the oxidative pentose phosphate pathway for carbohydrate catabolism, similar to nitrogen- fixing heterocystous bacteria (29, 56, 57).
Discussion
EGT and host protein import have been held as necessary to achieve the “permanent” maintenance of endosymbionts as integral cellular compartments, i.e. organelles (2, 5). But this hypothesis for organelle evolution has been challenged by findings in young endosymbionts from diverse organisms (13, 14, 36, 58). We report analysis of two genomes of Epithemia diatoms and evaluate the extent of their genetic integration with their nitrogen-fixing endosymbionts (diazoplasts), thereby adding this very young endosymbiosis to existing model systems that can elucidate the integration of two cells into one.
A window into the early dynamics of nuclear gene transfers. Our first significant finding was the detection of active diazoplast-to-nucleus DNA transfers but, as yet, no functional EGT in Epithemia. Our observations support findings in the chromatophore and UCYN-A that EGT is not necessary for genetic integration (10, 16, 59). Given that EGT does not necessarily precede evolution of host protein import pathways, it may be a suboptimal solution for the inevitable genome decay in small asexual endosymbiont populations as a consequence of Muller’s ratchet (60, 61). Instead, the decayed nature of the NUDTs we detected in E. clementina is consistent with stochastic, transient, ongoing DNA transfer. Nonfunctional DNA transfers were previously described from highly reduced mitochondria and plastids. The status of nuclear transfers from more recently acquired organelles is unknown, as only protein-coding regions were used as queries to identify chromatophore transfers in Paulinella and only a transcriptome is available for the UCYN-A host, B. bigelowii (10, 16, 62). Transfers from bacterial endosymbionts to animal hosts have also been described but likely have different dynamics, as the endosymbionts infect germline cells undergoing meiosis (63, 64). NUDTs in Epithemia genomes therefore provide a rare window into the early dynamics of DNA transfer. For example, using the same homology criteria, we identified 5 NUMTs but no NUPTs in E. clementina. The NUMTs were significantly shorter than NUDTs and did not show rearrangement, which may suggest different mechanisms of transfer for NUDTs, NUMTs, and NUPTs in the same host nucleus. In addition, between-species differences may identify factors that affect transfer rates. The lack of observed NUDTs in E. pelagica suggest constraints on diazoplast-to-nucleus transfers. Previous observations in plant chloroplasts supported the limited transfer window hypothesis, which proposes the mechanism of gene transfer requires endosymbiont lysis and therefore the frequency of gene transfers correlates with the number of endosymbionts per cell (65–67). However since E. pelagica and E. clementina contain similar numbers of diazoplasts per cell (1 and 2) (29, 31), the limited transfer window hypothesis does not explain the observed differences. Instead, there may be additional constraints imposed such as lower tolerance to DNA insertions in the comparatively smaller, nonrepetitive genome of E. pelagica. Finally, the lack of NUDT gene expression, even with transfer of a full-length unmutated tusA gene, points to barriers to achieving eukaryotic expression from bacterial gene sequence. Epithemia is an apt model system to interrogate how horizontal gene transfer impacts eukaryotic genome evolution with at least 20 species easily obtained from freshwater globally and consistently adaptable to laboratory cultures (29–31, 34, 37).
Epithemia diazoplasts as a counterpoint to existing models of organellogenesis. A second unexpected finding was the detection of only 6 host protein candidates in the diazoplast proteome, much fewer and with less clear functional significance than in comparable endosymbionts that have been designated organelles. Methods for validating the localization of these import candidates are unavailable in Epithemia. Even if confirmed to be targeted to the diazoplast, the candidates lack conserved domains or homology with cyanobacterial proteins to indicate they replace or supplement diazoplast metabolic function, growth, or division. Our findings are not explained by current models of organellogenesis that propose import of host proteins as a necessary step to establish an integral endosymbiotic compartment. In the traditional model described in the introduction, host protein targeting is a “late” bottleneck step required for the regulation of the endosymbiont growth and division. More recently, “targeting-early” has been proposed to account for establishment of protein import pathways prior to cellular integration as observed in kleptoplasts (20, 21). In this model, protein import is selected over successive transient endosymbioses, possibly driven by the host’s need to export metabolites from the endosymbiont via transporters (68). The establishment of protein import pathways then facilitates endosymbiont gene loss with metabolic functions fulfilled by host proteins leading to endosymbiont fixation. Contradicting both models, we observed minimal evidence for genetic integration despite millions of years of coevolution resulting in diverse Epithemia species retaining diazoplasts, indicating that genetic integration is not necessary for its stable maintenance. At a minimum, the unclear functions of the few host proteins identified in the diazoplast proteome, if imported, suggest that the genesis of host protein import in Epithemia is very different than would be predicted by current models.
Diazotroph endosymbioses are fundamentally different from photosynthetic endosymbioses that are the basis for current organellogenesis models. First, the diazoplast is derived from a cyanobacterium that became heterotrophic by way of losing its photosynthetic apparatus. Regulation of endosymbiont growth and division by the availability of host sugars (without requiring an additional layer of regulation via import of host metabolic enzymes) may be more facile with heterotrophic endosymbionts maintained for a nonphotosynthetic function compared to autotrophic endosymbionts. It will be interesting to see how integration of the diazoplast differs from the endosymbiont of Climacodium frauenfeldianum, another diazotrophic endosymbiont descended from Crocosphaera that likely retains photosynthesis (69, 70). Second, ammonia, the host-beneficial metabolite in diazotroph endosymbioses, can diffuse through membranes in its neutral form and does not require host transporters for efficient trafficking (71). Previously, we observed efficient distribution of fixed nitrogen from diazoplasts into host compartments following 15N2 labeling (29). Ammonia diffusion may have reduced early selection pressure for host protein import as posited by targeting-early models (21, 22). Finally, the eukaryotic hosts in most diazotroph endosymbioses are already photosynthetic, in contrast to largely heterotrophic hosts that acquired photosynthesis by endosymbiosis. For instance, cellular processes that enabled intracellular bacteria to take up residence in the ancestor of Epithemia spp. were likely different than those of the bacterivore amoeba ancestor of Paulinella chromatophora. Autotrophy and lack of digestive pathways would reduce the frequency by which bacteria might gain access to the host cell, such that the selection of host protein import pathways over successive transient interactions would be less effective. Overall, a universal model of organellogenesis is premature given the limited types of interaction that have been investigated in depth, highlighting instead the importance of increasing the diversity of systems studied.
Are diazoplasts “organelles”? As detailed in the introduction, diazoplasts show metabolic and cellular integration with their host alga comparable to that of UCYN-A, the first designated nitrogen-fixing organelle (10, 18). However, while hundreds of host proteins were detected in the UCYN-A proteome, including many likely to fill gaps in its metabolic pathways, a handful of host proteins with unknown function were detected in the diazoplast proteome. Based on Cavalier-Smith & Lee’s original hypothesis, which specifies genetic integration as the dividing line between endosymbionts and organelles, diazoplasts would not qualify (2, 5). However, over a decade ago, Keeling and Archibald (36) suggested that “if we use genetic integration as the defining feature of an organelle, we will never be able to compare different routes to organellogenesis because we have artificially predefined a single route.” They further hypothesized that if an endosymbiont became fixed in its host absent genetic integration, “it might prove to be even more interesting… by focusing on how it did integrate, perhaps we will find a truly parallel pathway for the integration of two cells.” The diazoplast appears to be such a parallel case in which nongenetic interactions were sufficient to integrate two cells.
If not gene transfer and host protein import, then what is the “glue” that holds this endosymbiosis together? The loss of cyanobacterial photosystems and dependence on host photosynthesis might allow diazoplast growth to be regulated by nutrient availability from the host (29). Though cyanobacteria are autotrophs, some express genes for sugar transport and may engage in mixotrophy (41, 72–75). Therefore, acquisition via horizontal gene transfer from another bacteria to the diazoplast ancestor, prior to the endosymbiosis, rather than targeting of a eukaryotic transporter postendosymbiosis, may be more likely. Consistent with this hypothesis, potential transporters were not detected among host protein import candidates detected in UCYN-A (10). Similarly, cytosolic host proteins may coordinate diazoplast division from outside the endosymbiotic compartment (11, 43). Eukaryotic dynamins required for mitochondria and chloroplast fission localize to the surface of the organellar outer membrane, acting coordinately with bacterial fission factors located in the organelle (42, 76). Diazoplasts appear to be surrounded by a host-derived membrane (27, 77) (which may be lost during diazoplast purification) (Fig. 4A). In analogy to dynamins, host proteins localized to this outer membrane may mediate diazoplast fission without requiring protein import pathways. Finally, cell density (78) and mechanical confinement (79) have been demonstrated to limit the growth of cyanobacteria, suggesting that host regulation of the volume of the endosymbiotic compartment could also be an effective mechanism. The mechanisms for the robust metabolic exchange and coordinated division observed in diazoplasts will be the focus of future studies.
Application of cell evolution models to bioengineering. Diazoplasts provide another example that the current organelle definition does not account for observations in many biological systems and may be overdue for revision to reflect biological significance in the spectrum of endosymbiotic interactions. At a minimum, it is time to disentangle the current definition of an organelle from models that elucidate the formation of integral cellular compartments. Identifying mechanisms to integrate cells is more than an academic exercise. The ability to engineer bacteria as membrane compartments to introduce new metabolic functions in eukaryotes would be transformative (80, 81). For example, nitrogen-fixing crop plants that could replace nitrogen fertilizers is a major goal for sustainable agriculture. Efforts to transfer the genes for nitrogen fixation to plant cells have been slow, hampered by the many genes required as well as the complex assembly, high energy requirements, and oxygen sensitivity of the reaction. We previously proposed an alternative strategy inspired by diazotroph endosymbioses: introducing nitrogen-fixing bacteria into plant cells as an integral organelle-like compartment (29). This approach has the advantage that diazotrophs express all required genes with intact regulation, coupled to respiration, and in a protected compartment. Diazoplasts, which achieve stable integration without significant genetic integration, are an important alternative to UCYN-A and other organelles, which are defined by their genetic integration, to inform this strategy. Identifying the nongenetic interactions that facilitated diazoplast integration with Epithemia will be critical for guiding bioengineering.
Ongoing genome reduction may drive genetic integration in diazotroph endosymbioses. The fewer number of host protein candidates and their lack of clear function in diazoplasts versus UCYN-A is not associated with differences in their function as nitrogen-fixing cellular compartments. Rather, an alternative explanation lies in differences in the extent of genome reduction: diazoplasts encode 1585-1848 protein-coding genes (31% of free-living Crocosphaera genes), compared to 1200-1246 in UCYN-A (22%) (82). As noted above, several missing UCYN-A genes whose functions are replaced by imported host proteins (IspD, ThrC, PGLS, and PyrE) (10) are retained in diazoplast genomes, obviating the need for host proteins to fulfill their functions in diazoplasts (SI Appendix, Fig. S5). Host protein import observed in kleptoplasts, despite absence of vertical inheritance, may also be the result of their highly reduced genomes. Moreover, some dinotom dinoflagellates have acquired diatoms as permanent tertiary plastids. The diatom plastids show no genome reduction nor evidence of genome integration and, similar to diazoplasts, are likely maintained by nongenetic mechanisms (58, 83).
Consistent with diazoplasts and UCYN-A being at different stages of genome reduction, diazoplast genomes contain >150 pseudogenes compared to 57 detected in the UCYN-A genome, suggesting diazoplasts are in a more active stage of genome reduction (28, 34, 37). Interestingly, even genes retained in UCYN-A, namely PyrC and HemE, have imported host-encoded counterparts (10). The endosymbiont copies may have acquired mutations resulting in reduced function, necessitating import of host proteins to compensate. Alternatively, once efficient host protein import pathways were established, import of redundant host proteins may render endosymbiont genes obsolete, further accelerating genome reduction. Genetic integration may in fact be destabilizing for an otherwise stably integrated endosymbiont, at least initially, as it substitutes essential endosymbiont genes with host-encoded proteins that may not be functionally equivalent and require energy-dependent import pathways. Comparing these related but independent diazotroph endosymbioses yields valuable insight, which otherwise would not be apparent. Diazoplasts may represent an earlier stage of the same evolutionary path as UCYN-A, in which continued genome reduction will eventually select for protein import pathways. Alternatively, diazoplasts may have evolved unique solutions to combat destabilizing genome decay, for example through the early loss of mobile elements. (28, 34, 84) Whether they represent an early intermediate destined for genetic integration or an alternative path, diazoplasts provide a valuable new perspective on endosymbiotic evolution.
Materials and Methods
High Molecular Weight DNA Extraction.
E. clementina were grown to a density of ~400,000 cells/mL and 20 to 30 million cells were used for each DNA extraction (see SI Appendix for culture details). Xenic cultures were centrifuged through a discontinuous Percoll gradient to deplete excess bacteria. Centrifugation steps were performed at 23 °C at 1,000×g. HMW DNA was isolated using nuclei isolation (85). Cells were suspended in a minimal volume of nuclear isolation buffer (NIB), transferred to a mortar, flash frozen, and ground with a pestle until a paste formed. Freezing and grinding was repeated a total of three times. Cell homogenate was transferred to a 15 mL falcon tube containing NIB and incubated at 4 °C for 15 min. No miracloth filtering was performed. The cell homogenate was spun at 4 °C and 2,900×g. The resulting nuclei pellet was rinsed with 15 mL NIB until the solution was clear of photosynthetic pigments. This nuclei/cellular compartment isolate was used as input for the Nanobind plant nuclei big DNA kit from PacBio. Steps were followed as in the kit protocol except reagent volumes were doubled and the Proteinase K digestion step was extended to 2 h. The Short Read Eliminator kit from PacBio was used to deplete DNA fragments <25 kb.
Nanopore library preparation and sequencing.
For all sequencing runs from axenic cultures of E. clementina, 1 to 2 µg of HMW DNA was used as input to the Oxford Nanopore sequencing by the ligation kit (SQK-LSK112). The nanopore protocol (Version: GDE_9141_v112_revC_01Dec2021) was followed with the following minor modifications: end repair was lengthened to 30 min at 20 °C and the adapter ligation was lengthened to 60 min at room temperature. Libraries were loaded onto primed, high-accuracy MinION R10.4 flow cells (FLO-MIN112) at a target of 9 fmoles of ~10 kb DNA. All sequencing of xenic cultures was performed similarly, but with sequencing kit (SQK-LSK110) and the flow cell MinION R9.4.1 (FLO-MIN111). If pore occupancy dipped below 1/3 of starting occupancy during the sequencing run, the run was paused and the flow cell was washed with the Flow Cell Wash Kit (EXP-WSH004). The same library was then reloaded and sequencing resumed. Each run until pore occupancy was near zero, 3 to 5 d.
Isolation of Genomic DNA for Illumina Sequencing.
DNA was extracted from axenic E. clementina cultures following the QIAGEN DNeasy Plant Pro Kit (69206) protocol. To lyse, cell suspension was transferred to QIAGEN tissue-disrupting tubes with 100 mg 0.5 mm autoclaved glass beads and beat in a bead-beater for 1 min. 300 ng of this axenic E. clementina DNA was used as input to the NEBNext Ultra II FS DNA Library Kit for Illumina (E7805S). A fragmentation time of 16 min was used for a target insert size of 200 to 450 bp. Samples were indexed with NEBNext Multiplex Oligos for Illumina Dual Index Primers Set 1 (E7600S). DNA concentration of libraries was determined with a Qubit dsDNA Quantification Assay High Sensitivity kit (Q32851). Final libraries were checked using an Agilent Bioanalyzer High-Sensitivity DNA chip. The mean insert size was 440 bp. The library was sequenced on an Illumina NextSeq 2000 P3 for 2 x 150 bp reads. Raw reads were trimmed and paired with fastp (--qualified_quality_phred 20, --unqualified_percent_limit 20) for a final total of 402 million read pairs from axenic E. clementina (86).
RNA Isolation and Sequencing.
Axenic cultures of E. clementina were seeded in 175 cm2 sterile vented flasks at 1.2 million cells/flask. For conditions of nitrogen repletion, media in the flask contained 100 µM of ammonium. Cells were kept in −N or +NH4+ conditions for 72 h and harvested 2 h into the day period. Cells in nitrogen depleted conditions were additionally collected 2 h into the night period. All cells were scraped from the flask, pelleted, resuspended in trizol, and flash frozen. Each condition was collected in triplicate, and the whole experiment was performed twice. To lyse, the trizol suspended cells were held on ice and sonicated with a microtip at 50/50 on/off pulses for 1 min at intensity setting six on a Branson 250 Sonifier (B250S). RNA was isolated using the QIAGEN RNeasy Plus Universal Mini Kit following the included protocol. 500 ng of RNA per condition per replicate collected from the first experiment was used as input to the NEBNext poly(A) mRNA Magnetic Isolation module to enrich for mRNA and the NEBNext Ultra II Directional RNA Library Prep Kit for Illumina and indexes from NEBNext Multiplex Oligos for Illumina Dual Index Primers Set 1 were used for library preparation. 350 ng of RNA per condition per replicate collected from the second experiment was used as input to the Zymo-Seq RiboFree Universal cDNA Kit and indexed with Zymo-Seq UDI Primer Set. For each experiment, libraries were pooled and sequenced on an Illumina NextSeq 2000 P3 for 2x150bp reads. E. pelagica cultures proved to be unculturable long-term in lab conditions after shipment. RNA was extracted upon receipt of overnight shipment from University of Hawaii at Manoa, HI. Otherwise, the same method of poly(A)-enrichment and Illumina sequencing was used as for E. clementina.
Genome Assembly of E. clementina.
Initial genome size, ploidy, and repeat content estimates were made by counting k-mers in the axenic Illumina reads with jellyfish v2.2.10 (-C -m -k 35 -s 5G) and plotting with GenomeScope (87, 88). Raw fast5 files were basecalled with guppy v1.1.alpha13-0-g1ec7786. Reads were filtered on minimum length 3 kb and quality 20 with Nanofilt v2.8.0 (89). Read statistics were calculated with NanoPlot v1.30.1. Quality checks were performed with fastqc v0.11.9 (90). 19.5 Gb of sequence data from axenic cultures of E. clementina and 30.2 Gb of sequence data from xenic cultures were used for a two-step assembly process. First, axenic reads were assembled with NextDenovo v2.5.0 (91). Then, xenic nanopore sequencing data were aligned to the axenic assembly using minimap2 (-ax map-ont) v2.24-r1122 to identify diatom reads in the xenic data (92). Finally, axenic and diatom-mapped xenic nanopore reads were combined and assembled with NextDenovo (defaults, except read_cutoff = 5 k, genome_size = 350 M). Axenic Illumina data were mapped to the assembly with BWA v0.7.17-r1188 (93). Contigs in the assembly were removed if <70% of the contig was covered by axenic Illumina reads or if reads mapped at <4% of mean depth of the other contigs. The axenic Illumina reads were used for 3 rounds of polishing with Racon v1.5.0 and one of Polca (MaSuRCA v4.0.5) (94–96). Further contamination analysis of the assembly and reads was performed with blobtools v1.1.1 (97). Genomes for the diazoplast, chloroplast, and mitochondria were assembled and annotated as previously reported (29). All contigs in the assembly were aligned to the organellar and diazoplast genomes to check for remaining contaminants. Any remaining organellar contigs in the nuclear assembly were removed if they aligned end-to-end to the organellar or endosymbiont genomes. Assembly statistics were extracted with QUAST v5.2.0 (98). Final assembly completeness and quality was assessed with the k-mer tool Merqury v1.3 (99) and BUSCO v5.3.2 (100).
Gene Annotation of E. clementina and E. pelagica.
The final nuclear assembly of E. clementina and the publicly released (45) nuclear sequence of E. pelagica (GCA_946965045.2) were used as input to the RepeatModeler2 and RepeatMasker pipelines (see SI Appendix for details). For E. pelagica and E. clementina, the masked nuclear genome was annotated in two independent runs of BRAKER2 v2.1.6. First, the BRAKER2 pipeline was given extrinsic protein evidence from the orthoDBv10 protozoa database manually edited to include diatom proteins from recent annotations. Second, the BRAKER2 pipeline was given transcriptomic evidence (101–110). To produce the aligned RNA-seq evidence, the RNAseq reads were quality filtered, trimmed, and paired with fastp (86) v0.22.0 (–qualified_quality_phred 20, –unqualified_percent_limit 20), and aligned to their source genome with hisat2 v2.1.0 (111) (–rna-strandness RF). For E. pelagica, a single 280 million read Illumina run from polyA-enriched RNA was used. For E. clementina, RNA from 30 samples and five different conditions using both polyA-enrichment and rRNA-depletion methods were used. Outputs of these independent annotations were merged using TSEBRA v1.0.3 (104). GTF files were edited with the fix_gtf_ids.py script included with TSEBRA. The output GTF files were converted to multi-isoform fasta files, removing pseudogenes or genes interrupted by stop codons using gffread v0.12.7 (112) (-J –no-pseudo -y). Completeness of the annotation was assessed with BUSCO v5.3.2. To inspect isoforms, the AGAT v1.0.0 agat_sp_keep_longest_isoform.pl tool was used (113). Functional annotation was performed with eggNOG-mapper (114, 115). For each species, genes in shared, Epithemia-specific orthogroups were used for enrichment tests against a gene universe of functionally annotated genes with p-value cutoff 0.1. clusterProfiler (116) was used for tests of significance. Proportion of each COG term assigned within each set of functionally annotated genes was used to calculate difference in proportion.
Orthologue Analysis.
Curated species proteomes and genomes were downloaded from NCBI or associated online repositories (45, 48, 117–125) (SI Appendix, Table S1). The agat package was used to remove short isoforms (agat_sp_keep_longest_isoforms.pl). Where necessary, gene feature files were reformatted (113) (agat_sp_manage_attributes.pl -p gene -att transcript_id). Finally, longest isoform proteomes were produced from the gene feature files and the corresponding species genome with gffread, removing genes without a complete, valid coding sequence and removing pseudogenes (112) (gffread -J –no-pseudo -y). The resulting proteomes were used as input for Orthologue analysis. Orthogroups were identified with orthofinder v2.5.4 (-M msa -T iqtree) and orthogroup overlaps between species were extracted from Orthogroups_SpeciesOverlaps (see SI Appendix for quantitation details) (126). To quantify sequence similarity, orthologue pairs were identified by reciprocal best BLAST between organism pairs and the full-length percent amino acid identity was calculated from the BLAST outputs, similar to the method used in (127).
NUDT Homology Search.
Whole genomes of free-living cyanobacteria which were previously shown to be phylogenetically close relatives of the diazoplast (31) and two other genomes of Crocosphaera available on NCBI were curated along with available whole endosymbiont genomes (SI Appendix, Table S1). These were used as queries for homology searches against the nuclear genomes of E. pelagica and E. clementina. Command line BLASTN with defaults, BLASTN using the custom settings previously validated for NUMT search (128, 129) (-reward 1 -penalty -1 -gapopen 7 -gapextend 2), minimap2 (-ax asm5 and -HK19), and nucmer v4.0.0rc1 were used to perform these searches (92, 130). As negative controls, the reversed sequence of the E. pelagica mitochondria and the E. coli genome were used. BLASTN identified all homology regions identified by other programs. Contiguous regions of homology <500 bp in length were not considered. >500 bp contiguous regions were considered candidate NUDTs. To verify that these alignments were not misassemblies, relative depth of long nanopore reads from axenic cultures spanning each insertion border was calculated. Contig read depth was calculated with samtools v1.16.1 (only primary alignments) (131). Mapped RNA-seq data from polyA enrichment and rRNA depletion experiments were normalized with deeptools v3.3.1 bamCoverage (--normalizeUsing BPM -p max -bs 100) (132). Using bedtools v2.30.0 intersect, the source regions were overlapped with endosymbiont gene regions (133). Nuclear and diazoplast sequences corresponding to gene fragment containing regions were aligned and % identity calculated using EMBOSS Needle v6.6.0.0 (134). Truncation was calculated by dividing gene fragment length by total source gene length. For both nuclear and diazoplast genomes, GC content variation was analyzed in sliding windows of 5000 bp and step size 1000 bp using bedtools makewindows and bedtools nuc. Alignments were visualized on Integrative Genomics Viewer (135) (IGV) and plotted with circlize (136).
Diazoplast Isolation.
E. clementina cells were harvested by scraping, then washed twice in CSI-N growth medium by centrifugation at 2,000×g, and resuspended in isolation buffer (50 mM HEPES pH 8.0, 330 mM D-sorbitol, 2 mM EDTA NaOH pH 8.0, 1 mM MgCl2). Cells were placed in a bath sonicator for 10 min followed by 3 low pressure cycles (500 psi) and by 5 high pressure cycles (2,000 psi) in an EmulsiFlex-C5 Homogenizer or until most cells appeared lysed. After a 1-min spin at 100xg to pellet the unbroken cells and broken frustules, supernatant was collected and centrifuged at 3,000xg for 5 min to concentrate the diazoplasts and other organelles to a volume of 3 to 4 mL. This fraction was split equally, and each half was laid on a discontinuous Percoll gradient. 89% Percoll, 10% 10xPBS, and 1% 1 M HEPES pH 8.0 was diluted with IB to generate the gradient, which consisted of 2 mL 90%, 3 mL 70%, 3 mL 60%, 3 mL 50%. The gradient was centrifuged for 20 min at 12,000×g, 4 °C.
The boundaries between the 60 and 70% layers and the 70 and 90% were collected, counted, and checked for purity via light microscopy. They were then diluted 1:6 in IB and centrifuged at 2,000×g for 2 to 3 min to collect diazoplasts, which were resuspended in 200 µL Extraction Buffer [100 mM Tris-HCl, pH8.0, 2% (wt/vol) SDS, 5 mM EGTA, 10 mM EDTA, 1 mM PMSF, 2× protease inhibitor (1 tablet each of cOmplete™ Protease Inhibitor Cocktail, and Pierce™ Protease Inhibitor, EDTA free)]. During optimization, enrichment was assessed by Western blot for NifDK on both the diazoplast and whole cell extracts.
Protein Extraction, Preparation, and LCMS/MS.
We generated whole cell lysate by homogenizing with a bead beater at 3000 strokes per minute for 3 min with 1 mm glass beads or until most cells appeared lysed under a microscope. Diazoplasts were lysed similarly using 0.5 mm beads. Beads were pelleted at 100×g for 1 min and the supernatant was removed; the beads were washed twice with 50 µL extraction buffer each by vortexing and centrifugation. These fractions were added to the supernatant for a total of 300µL, followed by an equal volume of cold Tris-buffered phenol (pH 7.5 to 7.9). This solution was vortexed for 1 min, centrifuged at 18,000×g for 15 min at 4 °C. The upper phase was discarded, then extracted with an equal volume of cold 50 mM Tris-HCl, pH8.0. The phenol phase was extracted with Tris-HCl three times, followed by addition of 0.1 M ammonium acetate in methanol and overnight incubation at −80 °C. Samples were transferred to new tubes and centrifuged at 18,000×g for 20 min at 4 °C. The supernatant was discarded and the pellet washed once in 0.1 M ammonium acetate in methanol and twice in 1 mL cold methanol by centrifugation for 5 min at 18,000 x g at 4 °C, followed by a short spin to remove trace methanol. The pellet was resuspended in 150µL resuspension buffer (6 M Guanidine-HCl in 25 mM NH4HCO3 pH8.0). Each sample was then reduced with TCEP at a concentration of 2 µM for 1 h at 56 °C, alkylated with iodoacetamide at a concentration of 10 mM for 1 h at ambient temperature, and then diluted with 3 volumes of 25 mM NH4HCO3. Sequencing grade modified trypsin was added at a ratio of 1:50 followed by overnight incubation at 37 °C, then repeated the next morning, followed by quenching the reaction by adding formic acid to a concentration of 1%. Each sample was then loaded onto a C18 cartridge, activated with 80% acetonitrile and 0.1% formic acid. The flow-through was loaded three times, followed by five washes with 1 mL 0.1% formic acid. The samples were eluted with 200 µL of 80% acetonitrile 1% formic acid and the flow-through loaded three times.
Peptide concentration was determined using Pierce™ Quantitative Colorimetric Peptide Assay. LC–MS/MS of the peptides was performed as in (137) with minor adjustments (see SI Appendix for details). MaxQuant v2.5.0 and Perseus v2.0.10.0 were used for proteomics analysis (see SI Appendix for details).
Supplementary Material
Appendix 01 (PDF)
Dataset S01 (XLSX)
Dataset S02 (XLSX)
Dataset S03 (XLSX)
Dataset S04 (XLSX)
Dataset S05 (XLSX)
Acknowledgments
We thank Chriz Schvarcz and Kelsey McBeain for generous sharing of E. pelagica cultures. We thank Dr. Devaki Bhaya, Dr. Arthur Grossman, and their lab members for support. We are grateful to Scott Miller and Heidi Abresch for their Epithemia expertise. Our thanks to Andres Reyes for mass spectrometry expertise. We thank Daniel S. Rokhsar, Jonathan Zehr, Kendra Turk-Kubo, Andy Alverson, Elizabeth Ruck, Paolo Carnevali, Dmitri Petrov, and Cedric Feschotte for advice. Anti-NifDK polyclonal antibodies were kindly provided by Dennis Dean. E.Y. is a Chan-Zuckerberg Biohub—San Francisco Investigator and supported by Burroughs Wellcome Fund. S.F. was partially funded by NIH training Grant (T32GM007276). M.S.O. was partially funded by NIH training Grant (5T32AI007328-32). The contents of this manuscript are solely the responsibility of the authors and do not represent the official views of the NCRR or the NIH. S.-L.X is funded by the NIH Grant S10OD030441 and the Carnegie Endowment Fund to the Carnegie Mass Spectrometry Facility.
Author contributions
S.F., M.S.-O., J.D., S.L.Y.M., T.B., and E.Y. designed research; S.F., M.S.-O., J.D., S.L.Y.M., T.B., and S.X. performed research; S.F. and S.L.Y.M. contributed new reagents/analytic tools; S.F., M.S.-O., J.D., and S.X. analyzed data; and S.F., M.S.-O., and E.Y. wrote the paper.
Competing interests
The authors declare no competing interest.
Footnotes
This article is a PNAS Direct Submission.
Data, Materials, and Software Availability
Supplemental raw data, sequencing data files, genome assembly and annotations data have been deposited in Mendeley Data; NCBI; Github (https://doi.org/10.17632/rr9t3ccbc5.1; PRJNA1147773; https://github.com/doenjon/Epithemia_assembly) (138–140). All study data are included in the article and/or supporting information.
Supporting Information
References
- 1.Archibald J. M., Endosymbiosis and eukaryotic cell evolution. Curr. Biol. 25, R911–R921 (2015). [DOI] [PubMed] [Google Scholar]
- 2.Cavalier-Smith T., Lee J. J., Protozoa as hosts for endosymbioses and the conversion of symbionts into organelles. J. Protozool. 32, 376–379 (1985). [Google Scholar]
- 3.Cavalier-Smith T., The origins of plastids. Biol. J. Linn. Soc. 17, 289–306 (1982). [Google Scholar]
- 4.Martin W., Herrmann R. G., Gene transfer from organelles to the nucleus: How much, what happens, and why?. Plant Physiol. 118, 9–17 (1998). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Theissen U., Martin W., The difference between organelles and endosymbionts. Curr. Biol. 16, R1016–R1017 (2006). [DOI] [PubMed] [Google Scholar]
- 6.Marin B., Nowack E. C. M., Melkonian M., A plastid in the making: Evidence for a second primary endosymbiosis. Protist 156, 425–432 (2005). [DOI] [PubMed] [Google Scholar]
- 7.Nakayama T., Ishida K., Another acquisition of a primary photosynthetic organelle is underway in Paulinella chromatophora. Curr. Biol. 19, R284–R285 (2009). [DOI] [PubMed] [Google Scholar]
- 8.Nowack E. C. M., Grossman A. R., Trafficking of protein into the recently established photosynthetic organelles of Paulinella chromatophora. Proc. Natl. Acad. Sci. 109, 5340–5345 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Singer A., et al. , Massive protein import into the early-evolutionary-stage photosynthetic organelle of the amoeba Paulinella chromatophora. Curr. Biol. 27, 2763–2773.e5 (2017). [DOI] [PubMed] [Google Scholar]
- 10.Coale T. H., et al. , Nitrogen-fixing organelle in a marine alga. Science 384, 217–222 (2024). [DOI] [PubMed] [Google Scholar]
- 11.Morales J., et al. , Host-symbiont interactions in Angomonas deanei include the evolution of a host-derived dynamin ring around the endosymbiont division site. Curr. Biol. CB 33, 28–40.e7 (2023). [DOI] [PubMed] [Google Scholar]
- 12.McCutcheon J. P., Boyd B. M., Dale C., The life of an insect endosymbiont from the cradle to the grave. Curr. Biol. 29, R485–R495 (2019). [DOI] [PubMed] [Google Scholar]
- 13.Nowack E. C. M., Paulinella chromatophora—Rethinking the transition from endosymbiont to organelle. Acta Soc. Bot. Pol. 83, 387–397 (2014). [Google Scholar]
- 14.Keeling P. J., McCutcheon J. P., Doolittle W. F., Symbiosis becoming permanent: Survival of the luckiest. Proc. Natl. Acad. Sci. 112, 10101–10103 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yamada N., Lewis W. H., Horiguchi T., Waller R. F., “dinotoms illuminate early pathways to the stable acquisition of photosynthetic endosymbionts” in Endosymbiotic Organelle Acquisition: Solutions to the Problem of Protein Localization and Membrane Passage, Schwartzbach S. D., Kroth P. G., Oborník M., Eds. (Springer International Publishing, 2024), pp. 183–201. [Google Scholar]
- 16.Nowack E. C. M., et al. , Gene transfers from diverse bacteria compensate for reductive genome evolution in the chromatophore of Paulinella chromatophora. Proc. Natl. Acad. Sci. 113, 12214–12219 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Ponce-Toledo R. I., López-García P., Moreira D., Horizontal and endosymbiotic gene transfer in early plastid evolution. New Phytol. 224, 618–624 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Suzuki S., et al. , Unstable relationship between Braarudosphaera bigelowii (=Chrysochromulina parkeae) and its nitrogen-fixing endosymbiont. Front. Plant Sci. 12, 749895 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Sørensen M. E. S., et al. , A novel kleptoplastidic symbiosis revealed in the marine centrohelid Meringosphaera with evidence of genetic integration. Curr. Biol. 33, 3571–3584.e6 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Hehenberger E., Gast R. J., Keeling P. J., A kleptoplastidic dinoflagellate and the tipping point between transient and fully integrated plastid endosymbiosis. Proc. Natl. Acad. Sci. 116, 17934–17942 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Larkum A. W. D., Lockhart P. J., Howe C. J., Shopping for plastids. Trends Plant Sci. 12, 189–195 (2007). [DOI] [PubMed] [Google Scholar]
- 22.Keeling P. J., The number, speed, and impact of plastid endosymbioses in eukaryotic evolution. Annu. Rev. Plant Biol. 64, 583–607 (2013). [DOI] [PubMed] [Google Scholar]
- 23.Sibbald S. J., Archibald J. M., Genomic insights into plastid evolution. Genome Biol. Evol. 12, 978–990 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Pfitzer E., Untersuchungen über Bau und Entwicklung der Bacillariaceen (Diatomaceen) (A. Marcus, 1871). [Google Scholar]
- 25.Drum R. W., Pankratz S., Fine structure of an unusual cytoplasmic inclusion in the diatom genus Rhopalodia. Protoplasma 60, 141–149 (1965). [Google Scholar]
- 26.DeYoe H. R., Lowe R. L., Marks J. C., Effects of nitrogen and phosphorus on the endosymbiont load of Rhopalodia gibba and Epithemia turgida (Bacillariophyceae). J. Phycol. 28, 773–777 (1992). [Google Scholar]
- 27.Prechtl J., Kneip C., Lockhart P., Wenderoth K., Maier U.-G., Intracellular spheroid bodies of Rhopalodia gibba have nitrogen-fixing apparatus of cyanobacterial origin. Mol. Biol. Evol. 21, 1477–1481 (2004). [DOI] [PubMed] [Google Scholar]
- 28.Nakayama T., et al. , Complete genome of a nonphotosynthetic cyanobacterium in a diatom reveals recent adaptations to an intracellular lifestyle. Proc. Natl. Acad. Sci. 111, 11407–11412 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Moulin S. L. Y., et al. , The endosymbiont of Epithemia clementina is specialized for nitrogen fixation within a photosynthetic eukaryote. ISME Commun. 4, ycae055 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Ruck E. C., Nakov T., Alverson A. J., Theriot E. C., Phylogeny, ecology, morphological evolution, and reclassification of the diatom orders Surirellales and Rhopalodiales. Mol. Phylogenet. Evol. 103, 155–171 (2016). [DOI] [PubMed] [Google Scholar]
- 31.Schvarcz C. R., et al. , Overlooked and widespread pennate diatom-diazotroph symbioses in the sea. Nat. Commun. 13, 799 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Foster R. A., et al. , Influence of the Amazon River plume on distributions of free-living and symbiotic cyanobacteria in the western tropical north Atlantic Ocean. Limnol. Oceanogr. 52, 517–532 (2007). [Google Scholar]
- 33.Benson M. E., Kociolek P. J., Spaulding S. A., Smith D. M., Pre-Neogene non-marine diatom biochronology with new data from the late Eocene Florissant Formation of Colorado, USA. Stratigraphy 9, 121–152 (2012). [Google Scholar]
- 34.Abresch H., Bell T., Miller S. R., Diurnal transcriptional variation is reduced in a nitrogen-fixing diatom endosymbiont. ISME J. 18, wrae064 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cornejo-Castillo F. M., et al. , Cyanobacterial symbionts diverged in the late Cretaceous towards lineage-specific nitrogen fixation factories in single-celled phytoplankton. Nat. Commun. 7, 11071 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Keeling P. J., Archibald J. M., Organelle evolution: What’s in a name?. Curr. Biol. 18, R345–R347 (2008). [DOI] [PubMed] [Google Scholar]
- 37.Nakayama T., Inagaki Y., Genomic divergence within non-photosynthetic cyanobacterial endosymbionts in rhopalodiacean diatoms. Sci. Rep. 7, 13075 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Muñoz-Marín M. del C., et al. , The Transcriptional Cycle Is Suited to Daytime N2 Fixation in the Unicellular Cyanobacterium “Candidatus Atelocyanobacterium thalassa”. UCYN-A). MBio 10, e02495-18 (2019). 10.1128/mbio.02495-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Landa M., Turk-Kubo K. A., Cornejo-Castillo F. M., Henke B. A., Zehr J. P., Critical role of light in the growth and activity of the marine N2-fixing UCYN-A symbiosis. Front. Microbiol. 12, 666739 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Kamakura S., Mann D. G., Nakamura N., Sato S., Inheritance of spheroid body and plastid in the raphid diatom Epithemia (Bacillariophyta) during sexual reproduction. Phycologia 60, 265–273 (2021). [Google Scholar]
- 41.Nieves-Morión M., et al. , Heterologous expression of genes from a cyanobacterial endosymbiont highlights substrate exchanges with its diatom host. PNAS Nexus 2, pgad194 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Gao H., Kadirjan-Kalbach D., Froehlich J. E., Osteryoung K. W., ARC5, a cytosolic dynamin-like protein from plants, is part of the chloroplast division machinery. Proc. Natl. Acad. Sci. 100, 4328–4333 (2003). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Zakharova A., et al. , A neo-functionalized homolog of host transmembrane protein controls localization of bacterial endosymbionts in the trypanosomatid Novymonas esmeraldas. Curr. Biol. 33, 2690–2701.e5 (2023). [DOI] [PubMed] [Google Scholar]
- 44.Maurya A. K., et al. , A nucleus-encoded dynamin-like protein controls endosymbiont division in the trypanosomatid Angomonas deanei. Sci. Adv. 11, eadp8518 (2025). [DOI] [PubMed] [Google Scholar]
- 45.Schvarcz C. R., et al. , The genome sequences of the marine diatom Epithemia pelagica strain UHM3201 (Schvarcz, Stancheva & Steward, 2022) and its nitrogen-fixing, endosymbiotic cyanobacterium. Wellcome Open Res. 9, 232 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kumar S., et al. , TimeTree 5: An expanded resource for species divergence times. Mol. Biol. Evol. 39, msac174 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Kooistra W. H. C. F., Gersonde R., Medlin L. K., Mann D. G., “The origin and evolution of the diatoms: Their adaptation to a planktonic existence” in Evolution of Primary Producers in the Sea, Falkowski P. G., Knoll A. H., Eds. (Academic Press, 2007), pp. 207–249. [Google Scholar]
- 48.Bowler C., et al. , The Phaeodactylum genome reveals the evolutionary history of diatom genomes. Nature 456, 239–244 (2008). [DOI] [PubMed] [Google Scholar]
- 49.Vancaester E., Depuydt T., Osuna-Cruz C. M., Vandepoele K., Comprehensive and functional analysis of horizontal gene transfer events in diatoms. Mol. Biol. Evol. 37, 3243–3257 (2020). [DOI] [PubMed] [Google Scholar]
- 50.Van Etten J., Bhattacharya D., Horizontal gene transfer in eukaryotes: Not if, but how much?. Trends Genet. 36, 915–925 (2020). [DOI] [PubMed] [Google Scholar]
- 51.Lopez J. V., Yuhki N., Masuda R., Modi W., O’Brien S. J., Numt, a recent transfer and tandem amplification of mitochondrial DNA to the nuclear genome of the domestic cat. J. Mol. Evol. 39, 174–190 (1994). [DOI] [PubMed] [Google Scholar]
- 52.Timmis J. N., Ayliffe M. A., Huang C. Y., Martin W., Endosymbiotic gene transfer: Organelle genomes forge eukaryotic chromosomes. Nat. Rev. Genet. 5, 123–135 (2004). [DOI] [PubMed] [Google Scholar]
- 53.Patron N. J., Waller R. F., Transit peptide diversity and divergence: A global analysis of plastid targeting signals. Bioessays 29, 1048–1058 (2007). [DOI] [PubMed] [Google Scholar]
- 54.Oberleitner L., Perrar A., Macorano L., Huesgen P. F., Nowack E. C. M., A bipartite chromatophore transit peptide and N-terminal protein processing in the Paulinella chromatophore. Plant Physiol. 189, 152–164 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Paysan-Lafosse T., et al. , Interpro in 2022. Nucleic Acids Res. 51, D418–D427 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Winkenbach F., Wolk C. P., Activities of enzymes of the oxidative and the reductive pentose phosphate pathways in heterocysts of a blue-green alga 1. Plant Physiol. 52, 480–483 (1973). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Summers M. L., Wallis J. G., Campbell E. L., Meeks J. C., Genetic evidence of a major role for glucose-6-phosphate dehydrogenase in nitrogen fixation and dark growth of the cyanobacterium Nostoc sp. strain ATCC 29133. J. Bacteriol. 177, 6184–6194 (1995). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Hehenberger E., Burki F., Kolisko M., Keeling P. J., Functional relationship between a dinoflagellate host and its diatom endosymbiont. Mol. Biol. Evol. 33, 2376–2390 (2016). [DOI] [PubMed] [Google Scholar]
- 59.Keeling P. J., Horizontal gene transfer in eukaryotes: Aligning theory with data. Nat. Rev. Genet. 25, 416–430 (2024). [DOI] [PubMed] [Google Scholar]
- 60.Muller H. J., The relation of recombination to mutational advance. Mutat. Res. 106, 2–9 (1964). [DOI] [PubMed] [Google Scholar]
- 61.Allen J. M., Light J. E., Perotti M. A., Braig H. R., Reed D. L., Mutational meltdown in primary endosymbionts: Selection limits muller’s ratchet. PLoS One 4, e4969 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Lhee D., et al. , Amoeba genome reveals dominant host contribution to plastid endosymbiosis. Mol. Biol. Evol. 38, 344–357 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Kondo N., Nikoh N., Ijichi N., Shimada M., Fukatsu T., Genome fragment of Wolbachia endosymbiont transferred to X chromosome of host insect. Proc. Natl. Acad. Sci. 99, 14280–14285 (2002). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Dunning Hotopp J. C., Horizontal gene transfer between bacteria and animals. Trends Genet. 27, 157–163 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Smith D. R., Crosby K., Lee R. W., Correlation between nuclear plastid DNA abundance and plastid number supports the limited transfer window hypothesis. Genome Biol. Evol. 3, 365–371 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Barbrook A. C., Howe C. J., Purton S., Why are plastid genomes retained in non-photosynthetic organisms?. Trends Plant Sci. 11, 101–108 (2006). [DOI] [PubMed] [Google Scholar]
- 67.Lister D. L., Bateman J. M., Purton S., Howe C. J., DNA transfer from chloroplast to nucleus is much rarer in Chlamydomonas than in tobacco. Gene 316, 33–38 (2003). [DOI] [PubMed] [Google Scholar]
- 68.Tyra H. M., Linka M., Weber A. P., Bhattacharya D., Host origin of plastid solute transporters in the first photosynthetic eukaryotes. Genome Biol. 8, R212 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Foster R. A., et al. , Nitrogen fixation and transfer in open ocean diatom–cyanobacterial symbioses. ISME J. 5, 1484–1493 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Carpenter E. J., Janson S., Intracellular cyanobacterial symbionts in the marine diatom Climacodium frauenfeldianum (Bacillariophyceae). J. Phycol. 36, 540–544 (2000). [DOI] [PubMed] [Google Scholar]
- 71.Ritchie R. J., The ammonia transport, retention and futile cycling problem in cyanobacteria. Microb. Ecol. 65, 180–196 (2013). [DOI] [PubMed] [Google Scholar]
- 72.Nieves-Morión M., Flores E., Multiple ABC glucoside transporters mediate sugar-stimulated growth in the heterocyst-forming cyanobacterium Anabaena sp. strain PCC 7120. Environ. Microbiol. Rep. 10, 40–48 (2018). [DOI] [PubMed] [Google Scholar]
- 73.Muñoz-Marín M. del C., et al. , Prochlorococcus can use the Pro1404 transporter to take up glucose at nanomolar concentrations in the Atlantic Ocean. Proc. Natl. Acad. Sci. 110, 8597–8602 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Muñoz-Marín M. del C., López-Lozano A., Moreno-Cabezuelo J. Á., Díez J., García-Fernández J. M., Mixotrophy in cyanobacteria. Curr. Opin. Microbiol. 78, 102432 (2024). [DOI] [PubMed] [Google Scholar]
- 75.Schneegurt M. A., Tucker D. L., Ondr J. K., Sherman D. M., Sherman L. A., Metabolic rhythms of a diazotrophic cyanobacterium, Cyanothece sp. strain ATCC 51142, heterotrophically grown in continuous dark. J. Phycol. 36, 107–117 (2000). [Google Scholar]
- 76.Leger M. M., et al. , An ancestral bacterial division system is widespread in eukaryotic mitochondria. Proc. Natl. Acad. Sci. 112, 10239–10246 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Gast R. J., Sanders R. W., Caron D. A., Ecological strategies of protists and their symbiotic relationships with prokaryotic microbes. Trends Microbiol. 17, 563–569 (2009). [DOI] [PubMed] [Google Scholar]
- 78.Esteves-Ferreira A. A., et al. , A novel mechanism, linked to cell density, largely controls cell division in Synechocystis. Plant Physiol. 174, 2166–2182 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Moore K. A., et al. , Mechanical regulation of photosynthesis in cyanobacteria. Nat. Microbiol. 5, 757–767 (2020). [DOI] [PubMed] [Google Scholar]
- 80.Liu F., Fernie A. R., Zhang Y., Can a nitrogen-fixing organelle be engineered within plants?. Trends Plant Sci. 29, 1168–1171 (2024). 10.1016/j.tplants.2024.07.001. [DOI] [PubMed] [Google Scholar]
- 81.Elhai J., Engineering of crop plants to facilitate bottom-up innovation: A possible role for broad host-range nitroplasts and neoplasts. bioXriv [Preprint] (2023), https://osf.io/ny2rc [Accessed 2 August 2024].
- 82.Bombar D., Heller P., Sanchez-Baracaldo P., Carter B. J., Zehr J. P., Comparative genomics reveals surprising divergence of two closely related strains of uncultivated UCYN-A cyanobacteria. ISME J. 8, 2530–2542 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Yamada N., et al. , Discovery of a kleptoplastic “dinotom” dinoflagellate and the unique nuclear dynamics of converting kleptoplastids to permanent plastids. Sci. Rep. 9, 10474 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.McCutcheon J. P., Moran N. A., Extreme genome reduction in symbiotic bacteria. Nat. Rev. Microbiol. 10, 13–26 (2012). [DOI] [PubMed] [Google Scholar]
- 85.Workman Rachel, et al. , High molecular weight DNA extraction from recalcitrant plant species for third generation sequencing. Protoc. Exch. (2018). 10.1038/protex.2018.059. [DOI] [Google Scholar]
- 86.Chen S., Zhou Y., Chen Y., Gu J., Fastp: An ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Ranallo-Benavidez T. R., Jaron K. S., Schatz M. C., GenomeScope 2.0 and Smudgeplot for reference-free profiling of polyploid genomes. Nat. Commun. 11, 1432 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Marçais G., Kingsford C., A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27, 764–770 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.De Coster W., D’Hert S., Schultz D. T., Cruts M., Van Broeckhoven C., NanoPack: Visualizing and processing long-read sequencing data. Bioinformatics 34, 2666–2669 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Andrew S., Babraham bioinformatics— FastQC a quality control tool for high throughput sequence data. (2019), https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [Accessed 2 August 2024].
- 91.Hu J., et al. , Nextdenovo: An efficient error correction and accurate assembly tool for noisy long reads. Genome Biol. 25, 107 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Li H., Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Li H., Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. bioXriv [Preprint] (2013), http://arxiv.org/abs/1303.3997 [Accessed 2 August 2024].
- 94.Vaser R., Sović I., Nagarajan N., Šikić M., Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 27, 737–746 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Zimin A. V., et al. , Hybrid assembly of the large and highly repetitive genome of Aegilops tauschii, a progenitor of bread wheat, with the MaSuRCA mega-reads algorithm. Genome Res. 27, 787–792 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Zimin A. V., Salzberg S. L., The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies. PLoS Comput. Biol. 16, e1007981 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Laetsch D. R., Blaxter M. L., BlobTools: Interrogation of genome assemblies. bioXriv [Preprint] (2017), https://f1000research.com/articles/6-1287 [Accessed 2 August 2024].
- 98.Mikheenko A., Prjibelski A., Saveliev V., Antipov D., Gurevich A., Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 34, i142–i150 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Rhie A., Walenz B. P., Koren S., Phillippy A. M., Merqury: Reference-free quality, completeness, and phasing assessment for genome assemblies. Genome Biol. 21, 245 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Manni M., Berkeley M. R., Seppey M., Simão F. A., Zdobnov E. M., BUSCO update: Novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol. Biol. Evol. 38, 4647–4654 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Kriventseva E. V., et al. , OrthoDB v10: Sampling the diversity of animal, plant, fungal, protist, bacterial and viral genomes for evolutionary and functional annotations of orthologs. Nucleic Acids Res. 47, D807–D811 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Stanke M., Diekhans M., Baertsch R., Haussler D., Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics 24, 637–644 (2008). [DOI] [PubMed] [Google Scholar]
- 103.Stanke M., Schöffmann O., Morgenstern B., Waack S., Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7, 62 (2006). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Gabriel L., Hoff K. J., Brůna T., Borodovsky M., Stanke M., TSEBRA: Transcript selector for BRAKER. BMC Bioinformatics 22, 566 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Brůna T., Hoff K. J., Lomsadze A., Stanke M., Borodovsky M., BRAKER2: Automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database. NAR Genomics Bioinforma. 3, lqaa108 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Brůna T., Lomsadze A., Borodovsky M., GeneMark-EP+: Eukaryotic gene prediction with self-training in the space of genes and proteins. NAR Genomics Bioinforma. 2, lqaa026 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 107.Lomsadze A., Ter-Hovhannisyan V., Chernoff Y. O., Borodovsky M., Gene identification in novel eukaryotic genomes by self-training algorithm. Nucleic Acids Res. 33, 6494–6506 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 108.Iwata H., Gotoh O., Benchmarking spliced alignment programs including Spaln2, an extended version of Spaln that incorporates additional species-specific features. Nucleic Acids Res. 40, e161 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 109.Gotoh O., Morita M., Nelson D. R., Assessment and refinement of eukaryotic gene structure prediction with gene-structure-aware multiple protein sequence alignment. BMC Bioinformatics 15, 189 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 110.Bruna T., Lomsadze A., Borodovsky M., A new gene finding tool GeneMark-ETP significantly improves the accuracy of automatic annotation of large eukaryotic genomes. bioXriv [Preprint] (2024), https://www.biorxiv.org/content/10.1101/2023.01.13.524024v5 [Accessed 13 December 2024]. [DOI] [PMC free article] [PubMed]
- 111.Kim D., Paggi J. M., Park C., Bennett C., Salzberg S. L., Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat. Biotechnol. 37, 907–915 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 112.Pertea G., Pertea M., GFF Utilities: GffRead and GffCompare. bioXriv [Preprint] (2020), https://f1000research.com/articles/9-304 [Accessed 2 August 2024]. [DOI] [PMC free article] [PubMed]
- 113.Dainat J., AGAT: Another Gff analysis Toolkit to handle annotations in any GTF/GFF format. (2022), 10.5281/zenodo.11106497. Deposited 2022. [DOI]
- 114.Cantalapiedra C. P., Hernández-Plaza A., Letunic I., Bork P., Huerta-Cepas J., eggNOG-mapper v2: Functional annotation, orthology assignments, and domain prediction at the metagenomic scale. Mol. Biol. Evol. 38, 5825–5829 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Huerta-Cepas J., et al. , EggNOG 5.0: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Res. 47, D309–D314 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Yu G., Wang L.-G., Han Y., He Q.-Y., ClusterProfiler: An R package for comparing biological themes among gene clusters. OMICS J. Integr. Biol. 16, 284–287 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Tanaka T., et al. , Oil accumulation by the oleaginous diatom Fistulifera solaris as revealed by the genome and transcriptome. Plant Cell 27, 162–176 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 118.Hongo Y., et al. , The genome of the diatom Chaetoceros tenuissimus carries an ancient integrated fragment of an extant virus. Sci. Rep. 11, 22877 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Osuna-Cruz C. M., et al. , The Seminavis robusta genome provides insights into the evolutionary adaptations of benthic diatoms. Nat. Commun. 11, 3320 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 120.Zepernick B. N., Truchon A. R., Gann E. R., Wilhelm S. W., Draft genome sequence of the freshwater diatom Fragilaria crotonensis SAG 28.96. Microbiol. Resour. Announc. 11, e00289-22 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.Paajanen P., et al. , Building a locally diploid genome and transcriptome of the diatom Fragilariopsis cylindrus. Sci. Data 4, 170149 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 122.Armbrust E. V., et al. , The genome of the diatom Thalassiosira pseudonana: Ecology, evolution, and metabolism. Science 306, 79–86 (2004). [DOI] [PubMed] [Google Scholar]
- 123.Roberts W. R., Downey K. M., Ruck E. C., Traller J. C., Alverson A. J., Improved reference genome for Cyclotella cryptica CCMP332, a model for cell wall morphogenesis, salinity adaptation, and lipid production in Diatoms (Bacillariophyta). G3 GenesGenomesGenetics 10, 2965–2974 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 124.Ferrante M. I., Broccoli A., Montresor M., The pennate diatom Pseudo-nitzschia multistriata as a model for diatom life cycles, from the laboratory to the sea. J. Phycol. 59, 637–643 (2023). [DOI] [PubMed] [Google Scholar]
- 125.Lommer M., et al. , Genome and low-iron response of an oceanic diatom adapted to chronic iron limitation. Genome Biol. 13, R66 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 126.Emms D. M., Kelly S., OrthoFinder: Phylogenetic orthology inference for comparative genomics. Genome Biol. 20, 238 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Rohwer R. R., Hamilton J. J., Newton R. J., McMahon K. D., Taxass: Leveraging a custom freshwater database achieves fine-scale taxonomic resolution. mSphere 3, e00327-18 (2018), 10.1128/msphere.00327-18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Camacho C., et al. , BLAST+: Architecture and applications. BMC Bioinformatics 10, 421 (2009). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Tsuji J., Frith M. C., Tomii K., Horton P., Mammalian NUMT insertion is non-random. Nucleic Acids Res. 40, 9073–9088 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Marçais G., et al. , MUMmer4: A fast and versatile genome alignment system. PLoS Comput. Biol. 14, e1005944 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 131.Danecek P., et al. , Twelve years of SAMtools and BCFtools. Gigascience 10, giab008 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Ramírez F., Dündar F., Diehl S., Grüning B. A., Manke T., Deeptools: A flexible platform for exploring deep-sequencing data. Nucleic Acids Res. 42, W187–W191 (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 133.Quinlan A. R., Hall I. M., BEDtools: A flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 134.Rice P. M., Bleasby A. J., Ison J. C., EMBOSS User’s Guide: Practical Bioinformatics with EMBOSS (Cambridge University Press, 2011). [Google Scholar]
- 135.Robinson J. T., et al. , Integrative genomics viewer. Nat. Biotechnol. 29, 24–26 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 136.Gu Z., Gu L., Eils R., Schlesner M., Brors B., Circlize implements and enhances circular visualization in R. Bioinformatics 30, 2811–2812 (2014). [DOI] [PubMed] [Google Scholar]
- 137.Zhang T., et al. , The metabolite itaconate is a transcriptional and posttranslational modulator of plant metabolism, development, and stress response. Sci. Adv. 11, eadt7463 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 138.Frail S., Data for “Genomes of nitrogen-fixing eukaryotes reveal an alternate path for organellogenesis”. Mendeley Data. 10.17632/rr9t3ccbc5.1. Deposited 21 July 2025. [DOI] [PMC free article] [PubMed]
- 139.Frail S., Bioproject PRJNA 1147773. National Center for Biotechnology Information. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1147773. Deposited 18 July 2025.
- 140.Doenier J., Epithemia_assembly. GitHub. https://github.com/doenjon/Epithemia_assembly. Deposited 12 September 2024.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Appendix 01 (PDF)
Dataset S01 (XLSX)
Dataset S02 (XLSX)
Dataset S03 (XLSX)
Dataset S04 (XLSX)
Dataset S05 (XLSX)
Data Availability Statement
Supplemental raw data, sequencing data files, genome assembly and annotations data have been deposited in Mendeley Data; NCBI; Github (https://doi.org/10.17632/rr9t3ccbc5.1; PRJNA1147773; https://github.com/doenjon/Epithemia_assembly) (138–140). All study data are included in the article and/or supporting information.
