Abstract
Genome sizes are known to vary within and among closely related species, but the knowledge about genomic factors contributing to the variation and their impacts on gene functions is limited to only a small number of species. This study identified a more than 2-fold heritable genome size variation among the unicellular Zygnematophycean alga, Closterium peracerosum–strigosum–littorale (C. psl.) complex, based on short-read sequencing analysis of 22 natural strains and F1 segregation analysis. Six de novo assembled genomes revealed that genome size variation is largely attributable to genome-wide copy number variation (CNV) among strains rather than mating type-linked genomic regions or specific repeat sequences such as rDNA. Notably, about 30% of genes showed CNV even between strains that can mate with each other. Transcriptome and gene ontology analysis demonstrated that CNV is distributed nonrandomly in terms of gene functions, such that CNV was more often observed in the gene set with stage-specific expression. Furthermore, in about 30% of these genes with CNV, the expression level does not increase proportionally with the gene copy number, suggesting presence of dosage compensation, which was overrepresented in genes involved in basic biological functions, such as translation. Nonrandom patterns in gene duplications and corresponding expression changes in terms of gene functions may contribute to maintaining the high level of CNV associated with extensive genome size variation in the C. psl. complex, despite its possible detrimental effects.
Keywords: gene duplication, dosage compensation, copy number variation, green algae
Significance.
Genome sizes are variable within and among closely related species, but their genetic causes and functional consequences remain controversial. Here, we report that a more than 2-fold heritable genome size variation among the Closterium peracerosum–strigosum–littorale complex is attributable to genome-wide copy number variation (CNV) between strains and that the expression level does not increase proportionally with the gene copy number in many genes with CNV, suggesting presence of dosage compensation, which was overrepresented in genes involved in basic biological functions, such as translation. Our study illustrates the generality of dosage compensation in a wide range of organisms and its possible contribution to the maintenance of the high level of CNV associated with extensive genome size variation.
Introduction
The genomes of eukaryotes vary tremendously in size, with more than a 200,000-fold variation among species (Gregory 2005). Genome size variation is mainly because of the different abundance of various repeat sequences, such as satellite DNA, transposable elements, and ribosomal genes (Kidwell 2002; Biémont 2008; Ågren and Wright 2011), whereas other factors, including heteromorphic sex chromosomes (Dolezel and Göhde 1995; Canon et al. 2000; Hjelmen et al. 2019), polyploidy, and aneuploidy (Rosato et al. 1998; Zhang et al. 2012; Poulíčková et al. 2014), have also been reported.
Substantial genome size variation within and among closely related species is relatively well documented in plants and animals (Šmarda and Bureš 2010; Long et al. 2013; Huang et al. 2014; Sun et al. 2018). In a few study systems, the genomic factors contributing to the variation, the inheritance pattern, and potential involvement of selection have been well characterized. In Arabidopsis thaliana, genome size ranges from 161 to 184 Mb among natural strains, showing an over 10% difference, which is largely explained by copy number variation (CNV) of 45S rDNA (Long et al. 2013). In the Drosophila melanogaster Genetic Reference Panel, genome size ranges from 169.7 to 192.8 Mb, and its skewed distribution toward the accumulation of large genomes suggests a greater constraint on genome reduction than expansion (Huang et al. 2014). Huang et al. (2014) also revealed that genome size is associated with polymorphic inversions. The rotifer Brachionus asplanchnoidis displays almost a 2-fold variation within a geographic population (Stelzer et al. 2019), and several large contiguous copy number variable regions, mainly composed of tandemly repeated satellite DNA, are associated with genome size (Stelzer et al. 2021). However, despite the prevalence of genome size variation within closely related species, the knowledge about the basic mechanisms and the inheritance pattern is still limited to only a few species.
In addition, how genome size variation would generally influence gene functions and phenotypes remains largely unknown. If genome size variation is partly due to CNV of genes, such variation altering gene dosage could potentially impact the expression, gene functions, and phenotypes, as gene dosage changes are often associated with detrimental effects such as disease, slow growth, and sensitivity to stress (Torres et al. 2007; Conrad et al. 2010; Cooper et al. 2011; Dephoure et al. 2014; Rice and McLysaght 2017). The prevalence of extensive genome size variation within species and among closely related species implies the presence of compensating mechanisms that buffer gene dosage changes and mitigate the imbalance. Recent studies detected dosage compensation in response to CNV among natural or artificial strains in yeast, plants, and humans (Huettel et al. 2008; Woodwark and Bateman 2011; Hose et al. 2015; Gasch et al. 2016; Zhang et al. 2017), but how generally dosage compensation is observed remains controversial (Torres et al. 2016), and so far, most studies have focused on a limited number of species, in particular model organisms. We note that whereas the widely recognized dosage compensation of sex chromosomes in animals is mainly regulated epigenetically at the chromosome scale (Park and Kuroda 2001; Disteche 2012), gene-wise modulations of expression are also observed, which are called dosage compensation in this study as well (Guo and Birchler 1994; Hose et al. 2015; Gasch et al. 2016).
Here, we investigated the genome size variation, its characterization at the genome sequence level, the pattern of inheritance, and the functional consequences of the variation in a group of unicellular isogamous cosmopolitan algae, the Closterium peracerosum–strigosum–littorale (C. psl.) complex. The C. psl. complex belongs to Zygnematophyceae, which is closely related to land plants (Wickett et al. 2014; fig. 1a). In Closterium, the haploid cells proliferate through vegetative reproduction, whereas under stressful conditions, two cells form a dormant zygospore to reproduce sexually (fig. 1b and c). In the C. psl. complex, two modes of zygospore formation have been reported: homothallism and heterothallism (Tsuchikane and Sekimoto 2019; Ohtaka and Sekimoto 2023). The homothallic strains can form zygospores within clonal cells (fig. 1b), whereas the heterothallic strains can form zygospores between cells of two different mating types: plus (mt+) and minus (mt−; fig. 1c). In heterothallic strains, mt+ and mt− recognize each other through the sex pheromones and then form zygotes (Sekimoto et al. 1994; Nojiri et al. 1995). Heterothallic strains have been further classified into reproductively isolated mating groups. Members of the different mating groups cannot mate with each other, except in the cases where reproductive isolation between groups II-A and II-B and between groups G and I-E is incomplete and asymmetric (Watanabe and Ichimura 1978; Tsuchikane et al. 2008; Tsuchikane et al. 2018a). Quite recently, genomes of both mt+ (NIES-67) and mt− (NIES-68) and the major genetic factor responsible for mating-type determination have been reported in group I-E (Sekimoto et al. 2023).
Fig. 1.
The life cycle of the C. psl. complex. (a) A photograph of vegetative cells of the NIES-65 strain in the C. psl. complex. (b, c) Schematic illustrations of the life cycle of homothallic (b) and heterothallic (c) strains in the C. psl. complex.
In this study, we identified 2-fold genome size variation among these natural strains using a short-read-based k-mer analysis. Flow cytometry of F1 hybrids, cells germinated from zygospores obtained by crossing, revealed that genome size variation is heritable and not significantly associated with mating types. De novo assembled genomes of six strains confirmed the genome size difference and revealed that CNV among strains was most strongly associated with genome size variation. Transcriptome analysis revealed that dosage compensation occurs in about 30% of genes with CNV. Thus, the expression level does not necessarily increase proportionally with the gene copy number in those genes. We found that specific gene categories, such as those involved in translation, are enriched in the dosage-compensated genes. The C. psl. complex serves as a new model system, paving the way for studying genome size evolution and how organisms functionally mitigate the copy number changes.
Results
Phylogenetic Relationship and Genome Size Variation in the C. psl. Complex
We first obtained short-read sequence data from 22 strains of the C. psl. complex to perform a phylogenetic analysis and estimate the genome size (supplementary table S1, Supplementary Material online). As a reference for the short-read mapping, we also generated a de novo assembled sequence of the NIES-4552 strain with a total length of 438 Mb and N50 of 952 kb (supplementary table S2, Supplementary Material online). We obtained 79,377 SNPs by mapping the short reads of 22 strains and constructed a phylogenetic tree using the maximum likelihood method. The tree showed that each of the five mating groups of heterothallic strains (II-A, II-B, II-C, I-E, and G) formed a distinct clade. In comparison, homothallic strains were separated into three groups, indicating multiple transitions of mating systems between heterothallism and homothallism (fig. 2).
Fig. 2.
Phylogenetic relationship and estimated genome size variation in the 22 strains of the C. psl. complex. The tree was constructed using the maximum likelihood method based on 79,377 SNPs obtained through mapping to the NIES-4552 strain as a reference. k-mer analyses estimated the genome size. II-A, II-B, II-C, I-E, and G indicate mating groups. Triangles and circles indicate homothallic and heterothallic (mt+ and mt−) strains, respectively. The bars show the genome size of the respective strains. Mating types (mt+ and mt−) are defined for each mating group and may not be comparable across II-A, II-B, II-C, I-E, and G, as different mating groups cannot reproduce with each other.
We further exploited short-read sequences to estimate the genome size of each strain using k-mer analysis. To estimate genome size, including high copy sequences, we calculated the total number of all k-mers without an upper limit and subtracted the peaks of organelle genomes from it (see Materials and Methods for details). The estimated genome size showed a more than 2-fold variation among strains, ranging from 454 to 1,188 Mb (fig. 2 and supplementary table S3, Supplementary Material online). Notably, genome size showed 16.2% and 24.3% differences even within the heterothallic mating groups II-A and II-B, respectively. Mating types (mt+ and mt−) and genome size (P = 0.775 for the group II-A; P = 0.723 for the group II-B; Student's t-test) were not associated significantly, suggesting that genome size variation is not solely explained by structural variation in sex-linked genomic regions. This finding is consistent with a report on a limited sex-linked region in group I-E (Sekimoto et al. 2023). After excluding high copy sequences, the association between genome size variation and mating types was still nonsignificant (P = 0.827 for the group II-A; P = 0.569 for the group II-B; Student's t-test; supplementary table S3 and fig. S1, Supplementary Material online; see Materials and Methods for details).
Inheritance of Genome Size Variation
Because we found extensive genome size variation even within the same mating groups, we next investigated whether and how genome size is inherited to the next generation. We established F1 hybrids between two heterothallic strains, NIES-64 and NIES-65, and estimated genome size of 18 F1 hybrids by flow cytometry of isolated nuclei stained with propidium iodide. Each strain was measured at least three times, and we found that genome size is heritable (51.9% of total genome size variance was due to the variance between F1 lines; P = 1.36 × 10−6; ANOVA) and that the median of genome size of all F1 lines was in between those of the two parental strains (fig. 3a). We also found that genome size variation was not significantly deviated from the unimodal distribution (P = 0.566; unimodality test; Ameijeiras-Alonso et al. 2021) and not significantly associated with mating types (P = 0.173; Student's t-test; fig. 3b). When F1 lines were narrowed down to those with the minimum number of subculturing (three times, corresponding to about 50–200 generations of vegetative reproduction) to minimize the effect of vegetative reproduction within each line (supplementary fig. S2, Supplementary Material online), we still observed about 130 Mb difference (∼90% of parental difference) between F1 lines, indicating that genome size variation is inherited during the sexual process at least partially. Together with the results of k-mer analysis, these findings overall suggest that the genome size variation between NIES-64 and NIES-65 is not due to the structural variation in specific genomic fragments such as sex-linked regions but rather due to structural variation in multiple genomic regions.
Fig. 3.
Genome size variation of F1 lines estimated by flow cytometry. (a) Genome size of each F1 line. Mating types (mt− and mt−) of each line are also indicated. (b) Genome size was not significantly different between mating types (P = 0.149; t-test). Each dot shows the average genome size of each F1 line. Bars and boxes represent the median and the interquartile range, respectively. Whiskers extend to 1.5 times the interquartile range.
Genome Size Variation Is Explained Largely by Gene Duplication
To determine the factors contributing to the genome size variation, we focused on the six strains (NIES-4552, NIES-4550, NIES-64, NIES-65, NIES-53, and NIES-54) that differ greatly in genome size. We performed long-read whole-genome sequencing and de novo assembly for these six strains (supplementary table S2, Supplementary Material online). As a result, the total length of the de novo assembled sequence varied from 438 (NIES-4552) to 975 Mb (NIES-4550), which was consistent with the estimates from the k-mer analyses (correlation coefficient = 0.960; P = 0.00238; supplementary fig. S3a, Supplementary Material online). We also obtained genome size estimates based on flow cytometry (Tsuchikane et al. 2023), which were significantly correlated with both the total length of the de novo assembled sequences and the estimates from the k-mer analyses (correlation coefficients: 0.974 [P = 0.00103] and 0.879 [P = 7.28 × 10−6], respectively; supplementary fig. S3bandc, Supplementary Material online). We also compared the nuclear size between the two strains, NIES-4550 (estimated genome size: 975 Mb) and NIES-4552 (estimated genome size: 438 Mb), finding that the nuclear size of the NIES-4550 was significantly larger than that of the NIES-4552 (P = 7.34 × 10−6; Student's t-test), consistent with the genome size estimates (supplementary fig. S4, Supplementary Material online).
We then performed gene annotation for each genome to compare gene contents among the strains. Overall, the proportion of detected proteins of Benchmarking Universal Single-Copy Orthologs (BUSCO) genes of Viridiplantae was >85% in all strains (supplementary table S2, Supplementary Material online). In comparison, “complete” was <70% in NIES-53 and NIES-54 and >83% in the other four strains, showing high levels of completion of assembly and gene annotation in the four strains and moderate draft levels in NIES-53 and NIES-54 (supplementary table S2, Supplementary Material online). We identified ortholog groups (OGs) with the SonicParanoid software (Cosentino and Iwasaki 2019). We then determined the copy number of each OG in the six strains. Focusing on the 5,400 OGs shared by all strains, we found that strains with larger genome sizes tended to have more duplicated genes (P = 0.0315; Pearson's product-moment correlation coefficient; fig. 4a and b). Notably, more than half of the OGs were duplicated in the strains NIES-65 and NIES-4550. Given that the analyzed samples were obtained during the haploid phase, this result suggests the presence of aneuploidy or segmental duplications as indicated by the bimodal frequency distributions in the k-mer analysis (supplementary fig. S5, Supplementary Material online). These results further suggest that CNV largely contributes to the genome size variation among strains.
Fig. 4.
Total genome size and CNV of the de novo assembled genomes. (a) The total length of the de novo assembled sequences. (b) Copy numbers of OGs shared by the six de novo assembled genomes. The numbers in the bar indicate the copy number. We defined the copy number as the number of genes assigned to the identical OG. (c) A schematic diagram illustrating how to calculate the average copy number of OGs per contig. (d) Distribution of the average copy number of the OGs per contig of four strains with high-quality genomes.
We next investigated if other genome characteristics contribute to genome size variation: specifically, the number of small copy number genes (less than five copies), the number of high copy number genes (more than nine copies), the average length of genes, introns and intergenic regions, the average number of introns, and the proportion of intergenic regions (table 1). Because of the relatively low contiguity of the NIES-53 and NIES-54 draft genomes (N50 = 51 and 40 kb, respectively), we used the other four high-quality genomes (NIES-4552, NIES-4550, NIES-64, and NIES-65) for these analyses. Among the characteristics analyzed, the number of small copy number genes, multiple-copy genes, and tandem repeat genes was significantly correlated with genome size variation (table 1). The distribution of the average copy number of the OGs per contig showed peaks from 1.5 to 3 in all strains, suggesting the presence of duplicated genes in most contigs (fig. 4c and d). The strains NIES-4552 and NIES-64 also had peaks at an average copy number of 1, indicating the presence of contigs without duplicated genes. The dot plots of self-alignments also supported the observation that there are duplicated genes distributed across the genome (supplementary fig. S6, Supplementary Material online). The number of tandem repeat sequences, defined by duplicated genes on the same contig, correlated with genome size; however, their proportions were less than 1% in any strain and their contributions to the genome size variation were considered minor. The proportion of repeat sequences varied among strains (47.7% for NIES-4552 and 64.4% for NIES-64), but the proportion did not correlate with genome size, indicating that repeated sequences were not the primary factor contributing to the genome size variation (tables 1 and S5, Supplementary Material online). We note that the proportion of repeat sequences also did not correlate with flow cytometry-based genome size estimates (P = 0.575). Overall, these results suggest that numerous gene duplications are the main factor contributing to genome size variation. Because the distribution of duplicated genes was found to be genome-wide, the variation would have been generated through whole-genome duplications followed by subsequent deletions or numbers of segmental duplications throughout the genomes.
Table 1.
Comparison of Key Features of Four High-Quality Genomes
NIES-4552 | NIES-64 | NIES-65 | NIES-4550 | P Values | Correlation Coefficient | |
---|---|---|---|---|---|---|
Total length | 438 Mb | 703 Mb | 876 Mb | 975 Mb | — | — |
Annotated genes | 27,822 | 34,905 | 40,587 | 52,529 | 0.061 | 0.939 |
Small copy number genes (<5) | 17,026 | 22,000 | 24,321 | 29,234 | 0.028 | 0.972 |
High copy number genes (>9) | 1,253 | 2,637 | 3,770 | 2,414 | 0.281 | 0.719 |
Single-copy genes (no. of OG) | 9,615 | 12,280 | 10,654 | 9,789 | 0.939 | 0.061 |
Multiple-copy genes (no. of OG) | 3,662 | 13,363 | 18,477 | 23,631 | 0.003 | 0.997 |
Tandem repeat genes (no. of OG) | 105 | 124 | 141 | 159 | 0.019 | 0.981 |
Proportion of tandem repeat genes (no. of OG) | 0.791% | 0.484% | 0.484% | 0.476% | 0.113 | −0.887 |
Mean length of transcripts | 2,014 bp | 2,000 bp | 2,173 bp | 2,114 bp | 0.251 | 0.749 |
Proportion of repeated elements | 47.2% | 64.4% | 62.6% | 53.7% | 0.528 | 0.472 |
Mean length of introns | 548.9 bp | 488.2 bp | 485.4 bp | 554.1 bp | 0.840 | −0.160 |
No. of introns per gene | 5.7 | 5.2 | 5.0 | 5.6 | 0.615 | −0.385 |
Mean length of intergenic region | 10,237 bp | 14,306 bp | 15,935 bp | 13,070 bp | 0.300 | 0.701 |
Proportion of intergenic region | 67.3% | 77.4% | 78.7% | 72.0% | 0.440 | 0.560 |
Note.—The P values and correlation coefficient indicate the result of Pearson's product-moment correlation between the genome size and each feature.
Independent Gene Duplications and Hybridizations
We next addressed the origin of duplicated genes. We first focused on the homothallic NIES-4550 strain due to its largest genome size and the highest gene copy number among the studied strains. Short-read genomic sequences of heterothallic strains of the mating groups II-A (NIES-58 and NIES-59) and II-B (NIES-64 and NIES-65) were mapped to the assembled genome of NIES-4550. We found that the coverage depth of reads from groups II-A and II-B showed a complementary pattern: genes with high coverage of group II-A reads tended to have low coverage of group II-B reads and vice versa. This result highlights the heterogeneous nature of the genome of the NIES-4550 strain (fig. 5a).
Fig. 5.
The origins of duplicated genes found in the group II-A, the group II-B, the NIES-4550 strain, and the NIES-4552 strain. (a) The mapping depth of short reads from groups II-A (NIES-58 and NIES-59) and II-B (NIES-64 and NIES-65) to the assembled genome of NIES-4550. (b) Clarifying the origins of duplicated genes found in the group II-A, the group II-B, and the NIES-4552 strain. If duplications occurred independently in each strain, gene copies from the same strains should form a clade. The origins of duplicated genes of the 246 OGs that have two to four copies in the group II-A, the group II-B, and the NIES-4552 strain. Each group formed a clade with high bootstrap support (≥70%) in 147–216 OGs. Note that this is a schematic phylogenetic tree, and three clades are shown in a single figure for simplicity. In the actual analysis, the monophyly of three groups was investigated separately.
The ortholog-based phylogenetic approach also supported the heterogeneous nature of the NIES-4550 strain. Of the 989 OGs that were duplicated only in the NIES-4550 strain, 351 OGs showed the pattern that a gene of NIES-4550 clustered with the mating group II-A and the other gene clustered with the mating group II-B (bootstrap ≥70; supplementary table S6, Supplementary Material online). This result suggests that the genome of the NIES-4550 strain contains genes closely related to the mating groups II-A and II-B. On the other hand, the NIES-4550 genome contained 4,250 genes that are distantly related from both the mating groups II-A and II-B by evolutionary distance analyses, suggesting that at least another ancestor contributed to the NIES-4550 genome in addition to the mating groups II-A and II-B (see supplementary note, Supplementary Material online for details). We found at least 103 contigs that include genes with different presumptive origins, suggesting that recombination between sequences of different origins occurred (supplementary fig. S7a, Supplementary Material online). The physical connection between genomic regions of different presumptive origins was confirmed experimentally by PCR amplification and Sanger sequencing of recombination borders (supplementary figs. S7b, S7c, and S8, Supplementary Material online; see supplementary note, Supplementary Material online for details). In summary, the heterogeneity of the NIES-4550 genome reflects its hybrid origin with subsequent rearrangements, including recombination between genes of different origins.
We constructed a phylogenetic tree for each OG to determine the origin of the duplicated genes found in the other five strains. Assuming that the duplicated genes originated from a gene duplication in a common ancestor, genes in a strain should form separate clades containing genes from both strain. If the duplication occurred in each strain after their separation, genes from a strain should form a clade without genes from the other strain, although concerted evolution by gene conversion may also result in a similar pattern (fig. 5b). We did not assume ancestral polymorphisms or crossbreeding between strains; thus, we did not distinguish between strains belonging to the same mating groups.
Focusing on the 246 OGs that have two to four copies in all five strains other than the NIES-4550 strain, we examined whether each strain or the mating group forms a clade. By classifying the tree topologies, we found that the group II-A, the group II-B, and the NIES-4552 strain formed a clade with high bootstrap support (≥70%) in 147, 216, and 200 OG trees, respectively (fig. 5b and supplementary table S7, Supplementary Material online). Most of the OG trees (239 OGs) was found to form clades in the group II-A, the group II-B, or the NIES-4552 strain with high bootstrap support (≥70%; fig. 5b and supplementary table S7, Supplementary Material online). These results indicate that at least a part of duplicated genes found in the group II-A, the group II-B, and the NIES-4552 strain originated from recent duplications within each group.
Correlation Between Expression Level and Gene Dosage
Given that the genomes of the C. psl. complex have high variability in gene copy numbers among strains, a compensating mechanism might be buffering gene expression, thereby tolerating the copy number changes. To investigate the relationship between copy number and gene expression, we performed RNA-seq analyses in the strains NIES-64 and NIES-65 at five different stages: during vegetative reproduction at morning (L), during vegetative reproduction at night (D), under a nitrogen-depleted condition (M), and in conditioned mating-inducing medium for 8 h (C8h) and 24 h (C24h; fig. 1 and supplementary table S8, Supplementary Material online; see Materials and Methods for details).
Between the two strains, 31.6% of OGs showed copy number difference based on an orthologous grouping using the SonicParanoid software (Cosentino and Iwasaki 2019; fig. 6a). The high fraction of genes showing copy number difference was also confirmed by the DNA depth data. The DNA depth log-ratio of each OG was defined as the difference in log2-transformed read counts normalized for the genome (NIES-65 − NIES-64). We changed the criteria of the DNA depth log-ratio for the definition of CNV depending on the total DNA depth, that is, the sum of normalized log2-transformed read counts of NIES-64 and NIES-65. This is because the values of nonzero peaks, indicating CNV, differed depending on the total DNA depth (figs. 6b and S9, Supplementary Material online; see Materials and Methods for detail). Based on the criteria, we found that 36.2% of OGs showed the high DNA depth log-ratio between NIES-64 and NIES-65 (fig. 6b).
Fig. 6.
Correlations between expression levels and gene dosages. (a) Copy number difference of OGs between the NIES-64 and NIES-65 strains. We defined the copy number as the number of genes assigned to the identical OG. OGs with more than four copies are not shown. (b) Frequency of DNA depth log-ratio for each OG (see supplementary fig. S9, Supplementary Material online for other ranges of a total DNA depth). (c) An MA plot of gene expression difference between the NIES-64 and NIES-65 strains. Expression data from the stage L were used (see supplementary fig. S10, Supplementary Material online for other stages). The number of genes and color in each category, DEGs or not, and with or without CNV, is shown in the upper right table. The genes with CNV were defined based on the DNA depth log-ratio (see Materials and Methods for details). (d) The expression level and gene dosage difference between the NIES-64 and NIES-65 strains. Expression data from the stage L were used (see supplementary fig. S11, Supplementary Material online for other stages). A solid line shows the expected relationship representing proportional increases in expression difference to dosage difference, whereas a dashed line represents the simple regression. Dosage-compensated genes are shown in red.
We investigated the relationship between the expression level and the copy number between the two strains, NIES-64 and NIES-65. We found that 31.8–39.5% OGs showed significantly different expressions between the two strains (false discovery rate [FDR] <0.05). For the following analysis, the definition of copy number was based on the DNA depth log-ratio. In every stage, the presence of CNV was associated with the difference in expression levels: genes with CNV were significantly enriched in the set of differentially expressed genes (DEGs; fold enrichment >1.24, P < 2.2 × 10−16; Fisher's exact test; figs. 6c and S10, Supplementary Material online). We also found that expression level differences and gene dosage differences were positively correlated (P < 0.01, correlation coefficient >0.344; figs. 6d and S11, Supplementary Material online), but the slopes (regression coefficient) were 0.654–0.683, significantly smaller than 1 (P < 0.001); the expression level overall responded to gene dosage, but less than proportionally. In addition, the gene-specific analysis revealed that 30.6–39.7% of the OGs with CNV showed significantly smaller expression level differences than expected from an assumption that expression level is proportional to the gene dosage (FDR < 0.05; figs. 6d and S11, Supplementary Material online). We found that 6.87–13.3% genes showed significantly higher expression level difference than proportional to gene dosage (FDR < 0.05), which was defined as genes with a positive dosage response. Although the frequency of genes with a positive dosage response was generally low, the frequency was relatively high in the sexual reproductive stage, suggesting that these genes reflect mating type-specific differences in expression (7.28% in L, 6.87% in D, 9.78% in M, 11.0% in C8h, and 13.3% in C24h). Furthermore, among the genes for which the expression level difference was smaller than proportional to the gene dosage, 18.4–29.9% of genes had a significantly higher expression in the low copy number strain than in the high copy number strain (FDR < 0.05; “inverse effect” by Guo and Birchler (1994)). We defined dosage-compensated genes as genes not showing the inverse effect (FDR > 0.1) but showing significantly smaller expression level differences than proportional to gene dosage (following Zhang et al. (2017)), identifying 24.9–29.4% of genes with CNV as dosage-compensated genes.
Gene Function and Expression Level Variation
Because we found different gene expression patterns associated with gene dosage, we investigated which biological functions might be related to dosage sensitivity of gene expression and CNV between NIES-64 and NIES-65. We performed gene ontology (GO) enrichment analyses for dosage-compensated genes. We found in total 27 significantly enriched GO terms through the analyses of two or more stages (fig. 7). In the set of dosage-compensated genes, we found the enrichment of “translation” (GO:0006412), “structural constituent of ribosome” (GO:0003735), “mRNA binding” (GO:0003729), “mRNA splicing, via spliceosome” (GO:0000398), “Golgi membrane” (GO:0000139), and “trans-Golgi network” (GO:0005802; fig. 7 and supplementary tables S9–S13, Supplementary Material online). The results of enrichments suggest that the expression levels of genes involved in translation and the trans-Golgi network tended to be robust to gene dosage changes. Interestingly, we also found the enrichment of “mRNA splicing, via spliceosome” (GO:0000398), “mRNA binding” (GO:0003729), “regulation of mRNA stability” (GO:0043488), “ribosome biogenesis” (GO:0042254), and “structural constituent of ribosome” (GO:0003735) in the gene set that does not show CNV between NIES-64 and NIES-65 (supplementary table S14, Supplementary Material online). These results overall suggest that genes involved in translation are less likely to have CNV, and that even when CNV is observed, expression level difference is less than proportional to copy number difference.
Fig. 7.
GO enrichment analysis of dosage-compensated genes. Among the GO terms including more than nine genes, those with enrichment of dosage-compensated genes in at least two stages are shown (P < 0.05; Fisher's exact test with the “weight01” algorithm in the R package topGO, Alexa et al. 2006). The dot size indicates the fold enrichment. Each column shows results during each stage: during vegetative reproduction in light (L), during vegetative reproduction in dark (D), under a nitrogen-depleted condition (M), and in the mating-induced conditioned medium for 8 h (C8h) and 24 h (C24h).
We also found that the genes without CNV were overrepresented in the gene set that showed constitutive expression along five different stages (P = 3.63 × 10−14, supplementary table S15, Supplementary Material online). Conversely, genes with CNV were overrepresented in genes that were not expressed in one to four stages (P = 0.0201, supplementary table S15, Supplementary Material online), indicating that genes with stage-specific expression are more prone to change gene dosage by duplication or deletion.
Discussion
Although genome size variation within or among closely related species is common, few studies so far have addressed the genomic factors contributing to the variation and its consequences for gene function (Šmarda and Bureš 2010; Long et al. 2013; Huang et al. 2014; Sun et al. 2018). Here, we revealed that the more than 2-fold heritable genome size variation among closely related strains in the C. psl. complex was largely attributable to genome-wide CNV rather than the number of tandem repeat sequences or the proportion of repeat sequences (table 1 and figs. 2 and 4). Notably, more than 30% of OGs showed CNV even between genetically close, sympatric strains that can mate with each other, namely NIES-64 and NIES-65 (fig. 6a and b). Here, we note a few caveats for these findings. First, the reliability of these analyses depends on the quality of de novo assembly, because the genome size, the number of repeat sequences, and the fraction of CNV are all strongly influenced by the quality of de novo assembly. Nonetheless, flow cytometry-based genome size estimates showed significant correlations with both the total length of the de novo assembled sequences and the estimates from the k-mer analyses. Because the flow cytometry-based estimation is independent from sequence artifacts, this result supports the reliability of our sequence-based genome size estimates, although the contribution of repeat sequences missed in assembly is not fully ruled out (supplementary fig. S3, Supplementary Material online). Second, although we discuss the inheritance of genome size variation, variation found among natural strains should include both mutations occurred during sexual reproduction and those accumulated during vegetative reproduction, which cannot be distinguished by our approach of comparing genome sequences. Yet, differences among F1 lines were still observed when narrowing down the samples to reduce the effect of vegetative reproduction (supplementary fig. S2, Supplementary Material online). Furthermore, the flow cytometry analysis of F1 hybrids revealed that 51.9% of total genome size variance was due to the variance among F1 lines and that the median of genome size of all F1 lines was in between those of the two parental strains. In summary, although we acknowledge these caveats and limitations, independent lines of evidence from de novo assembly, k-mer-based analysis, and flow cytometry analysis consistently suggest that genome size variation is largely attributable to genome-wide CNV and is inherited during the sexual process at least partially.
CNV has been reported in several species: copy number variable regions cover 12% of the human genome (Redon et al. 2006), and 8% of the regions of the D. melanogaster genome were estimated to contain CNVs (Dopman and Hartl 2007). The fraction of OGs showing CNV found in the C. psl. complex (about 30%) is strikingly high compared with these reports. We note that this would still be underestimated as the CNV within species because here we compared a single pair of genetically close, sympatric strains. Recently, the cat flea, exhibiting a broad range of genome size variation, was also shown to have a genome with inordinate CNV (∼38% of proteins; Driscoll et al. 2020). The fraction of CNV and genome size variation appears to be highly diverse among lineages, and accumulating knowledge in a wide range of organisms will allow us to understand which factors determine those variations across lineages.
Why is such a high level of CNV harbored in the C. psl. complex, despite the possible detrimental effects of CNVs (Torres et al. 2007; International Schizophrenia Consortium 2008; Cooper et al. 2011; Dephoure et al. 2014)? Our study suggests that possible detrimental effects could be diminished in multiple ways. CNV is prevalent in the C. psl. complex, and genes showing CNV are nonrandom in terms of gene functions. Specifically, we found that 1) genes involved in basic biological functions were less likely to have CNV, 2) CNV was more often observed in the gene set with stage-specific expression, and 3) genes without CNV were overrepresented in the genes with constitutive expression. A large-scale analysis across angiosperms revealed that likelihood of gene duplications differs depending on gene functions, such that single-copy genes are enriched for “essential” genes that are required for early development and survival. By contrast, “conditional” genes that respond to biotic and abiotic stress are more prone to duplicate (Li et al. 2016). Similar gene functions and duplicability patterns are also demonstrated in fungi (Wapinski et al. 2007).
Our transcriptome analyses also revealed that the expression level does not increase proportionally with the gene copy number, suggesting the presence of a compensating mechanism that buffers gene expression (figs. 6d and S11, Supplementary Material online). Such a mechanism may tolerate the gene dosage changes, possibly allowing CNVs in the genomes. The GO analysis demonstrated that the responsiveness to gene dosage differs depending on gene function: genes involved in basic biological functions, such as translation, especially ribosomal proteins, were enriched in the dosage-compensated genes (fig. 7). Enrichment of those gene sets involved in translation was also observed in the genes subject to dosage compensation in aneuploidy strains of Saccharomyces cerevisiae (Hose et al. 2015), and dosage compensation of ribosome-related genes has also been reported in other organisms, including flies, wheat, and humans (Hose et al. 2015; Lee et al. 2016; Zhang et al. 2017; Hwang et al. 2021; Schukken and Sheltzer 2022). We note that the results of gene set enrichment analyses should be interpreted with caution because of its potential bias due to the heterogeneity of the source data (Gaudet and Dessimoz 2017). There are a few experimental supports for dosage compensation of translation-related genes. Artificial changes in the copy number of rRNA and ribosomal proteins did not alter their expression proportional to the copy number (Pearson et al. 1982; Lopez et al. 2021), possibly by the coordinate regulation of ribosomal proteins and rRNA to keep dosage balance (Warner 1999). An analysis of the hybrid genomes of Trichosporon revealed that the gene-loss rate among protein complexes involved in translation was significantly higher than that among other complexes (Sriswasdi et al. 2016). Our study suggests that dosage compensation, particularly in translation, is a common phenomenon in many organisms. In summary, gene duplication and expression changes in response to gene duplication are nonrandom in terms of gene functions, possibly contributing to the maintenance of the high level of CNV in association with extensive genome size variation in the C. psl. complex. We also note that a recent study reported a difference of chromosome number in association with genome size difference even within the same mating group, suggesting a mechanism tolerating chromosomal rearrangements during meiosis in the C. psl. complex (Tsuchikane et al. 2023). This organism could serve as a new model system, paving the way for studying the genome size evolution and how organisms functionally mitigate the potentially pernicious effects of copy number changes.
Materials and Methods
Studied Strains
We used in total 22 strains of the C. psl. complex (supplementary table S1, Supplementary Material online), namely 6 homothallic strains and 16 heterothallic strains (including 6 mt+ strains were and 10 mt− strains). These strains were obtained from the National Institute for Environmental Studies (NIES), Environment Agency (Ibaraki, Japan). Mating group I-E is denoted by IB in the NIES.
Culture and Mating Conditions
All strains were cultured in a nitrogen-supplemented medium (CA medium, Kasai et al. 2004) at 21 °C in a 16 h light/8 h dark cycle. To induce sexual reproduction, cells vegetatively grown for 10 days were washed three times with a nitrogen-depleted mating-inducing medium (MI medium, Ichimura 1971). Then, one or two strains were mixed and incubated in an MI medium (10,000 cells/ml) under 24 h continuous light.
DNA and RNA Extraction
DNA was isolated from cells vegetatively grown for 2 weeks. For NIES-53 and NIES-54, we extracted DNA using NucleoBond HMW DNA (Takara Biotechnology, Japan) and eliminated the short fragments using the Short Read Eliminator Kit (PacBio). Following Shure et al. (1983), we extracted DNA for other strains and purified it using a Blood and Cell Culture DNA mini kit (Qiagen). Finally, total RNA was extracted from multiple stages (supplementary table S8, Supplementary Material online) using the TRIzol plus RNA purification kit (Thermo Fisher Scientific).
Genome and Transcriptome Sequencing
We performed short-read genome sequencing of 21 strains using Illumina HiSeq 2500/3000 sequencing (10 × to 60 × coverage, 100/150 bp paired-end; supplementary table S16, Supplementary Material online). The library preparation and short-read sequencing were performed by BGI Japan (Kobe, Japan) or K. K. DNAFORM (Yokohama, Japan). The short-read data of the NIES-68 strain were previously published (accession number: DRR097270, Kanda et al. 2017). For the NIES-53 and NIES-54 strains, the library preparation and long-read sequencing of PromethION were performed by GeneBay (Yokohoma, Japan). The PacBio Sequel sequencing was performed for NIES-4552, NIES-4550, NIES-64, and NIES-65. In addition, 10 × genomics linked reads of NIES-64 and NIES-65 were obtained from Macrogen (Kyoto, Japan). We obtained short-read paired-end transcriptome sequences of the six strains using MGI DNBSEQ-G400, MGI MGISEQ-2000, or Illumina HiSeq 2500 (supplementary table S8, Supplementary Material online). The library preparation for transcriptome sequencing runs was performed by GENEWIZ Japan (Tokyo, Japan).
Genome Size Estimation on k-mer Analysis
We performed a k-mer analysis on short-read sequences to estimate the genome size. We first removed reads with a Phred quality score <20 and reads length <50 bases using SolexaQA++ v3.1.7.1 (Cox et al. 2010). k-mer frequency distribution with k = 25 was calculated using JELLYFISH v2.2.4 with the -h 10000000 option for each strain (Marçais and Kingsford 2011). The k-mer occurrences due to error sequences were defined as smaller than the first local minimal value (Xa). The first peak of the k-mer occurrences was Xb (supplementary fig. S12, Supplementary Material online). Therefore, we first estimated the genome size by calculating the total number of all k-mer from Xa to 20 × Xb divided by Xb. However, this method should be an underestimate because it excludes high copy (>20) sequences.
Therefore, we also calculated the genome size without the upper limit of k-mer (supplementary table S3, Supplementary Material online). Because the high copy sequences should include those from organellar genomes, we excluded them in the following way. We first estimated organellar genome size by de novo assembly of the organellar genomes from short-read sequences using GetOrganelle with the -k 21,25,55,85,115 -F embplant_mt, embplant_pt option (Jin et al. 2020). Mitochondrial and plastid genome sizes were estimated to be 98,950–193,473 and 139,599–158,119 bp, respectively. We then calculated the coverage ratio of organellar and nuclear genomes. We identified two peaks in high k-mer (>500) indicating reads from mitochondrial (Xm) and plastid (Xp) genomes and divided them by Xb.
Using the genome of the NIES-4550 strain, the relative size of Xm and Xp (Xm < Xp) was confirmed based on the following mapping. First, we concatenated the de novo assembled full genome and the assembled organellar genomes and mapped short-read sequences. Second, we calculated the depth of each position using the samtools depth with the -a -d 0 option (Li et al. 2009). Third, as the de novo assembled full genome also included contigs from organellar genomes, we searched the assembled organellar genomes against the de novo assembled full genome using BLASTN v2.0.15 with default parameters (Altschul et al. 1990) and identified the organellar contigs as the contigs with the number of identical matches >30,000 and the percentage of identical matches >99%. We then defined the coverage of organellar genomes as the sum of the average depth of these contigs and assembled organellar genomes. The coverage of mitochondrial and plastid genomes was 17,639.6 and 77,398.5, respectively, suggesting that the coverage of mitochondria is smaller than that of plastid.
Investigation of the Inheritance of Genome Size Variation Using Flow Cytometry
To establish the F1 hybrids between heterothallic strains, NIES-64 and NIES-65 strains were mixed and sexual reproduction was induced in an MI medium (1,000 cells each/200 μl) in 48-well plates. First, formed zygospores were cultured at 21 °C under 24 h continuous light for 3 weeks. Next, we left zygospores at 21 °C under dark conditions for a month to dry out fully. We then added 300 μl of a CA medium to induce germination of the zygospores. Following Tsuchikane et al. (2018b), vegetative cells from the germinated progenies were isolated and cultured in a conditioned CA medium, a sterilized CA medium in which cells were cultured for 2 weeks. For subsequent analysis, we used F1 lines that underwent two to five times of subculturing every 10 days to 2 months, which would roughly correspond to the order of 100 generations of vegetative reproduction. Note that this subculturing period was relatively short compared with those of the parental strains maintained at the NIES stock center, which were subcultured every 10 days to 4 months since 1974 (Kasai et al. 2004).
To understand how genome size variation between strains is inherited, the genome size of the F1 lines was estimated by flow cytometry. First, 50,000 cells were washed with a working solution (0.4 m sorbitol and 30 mm 2-(N-morpholino)ethanesulfonic acid), then incubated using a 100 μl enzymatic mixture of 0.5% Cellulase Onozuka R-10 and 0.5% Macerozyme R-10 dissolved in the working solution at 29 °C for 30 min. Next, samples were centrifuged at 14,000 rpm for 5 min, and the supernatant was decanted. To isolate nuclei, we added 500 μl of Cystain PI Absolute P nuclei extraction buffer (Sysmex) and filtered through a 20-µm filter (2-7209-01; AS ONE CORPORATION, Osaka, Japan).
We used Lycopersicon esculentum ‘Stupicke’ as a control (Dolezel et al. 1992). Young leaf (∼1 mg) was mixed with 150 μl of Cystain PI Absolute P nuclei extraction buffer and chopped with a razor for 3 min, then the sample was filtered through a 20-μm filter. Next, we added 1,200 μl Cystain PI Absolute P staining solution (Sysmex), which contained 12 μg RNase A to digest RNA, to the filtrate of C. psl. complex nuclei and control nuclei, and left it for 3 h at 4 °C.
Fluorescence was measured using a BD Accuri C6 flow cytometer (BD Biosciences) equipped with a 488-nm laser. Measurement was performed at least three times per strain. We obtained fluorescence intensity at FL2 = 585 ± 20 nm and signal width as the size of the particles. We analyzed the output FCS files using the R package flowCore (Hahne et al. 2009). We excluded particles with a width <40 or >80 and detected peaks as nuclear DNA content. Because the peak representing 2C was detected more prominently than 1C in all strains, we estimated the 1C genome size from the second peak. We confirmed that these estimates were strongly correlated with the observed values for the first peak, 1C (R2 = 0.912; supplementary fig. S13, Supplementary Material online). The genome size of each sample was estimated based on that of L. esculentum ‘Stupicke’ (1C = 958 Mb, Dolezel et al. 1992; supplementary table S4, Supplementary Material online). Strains measured at least three times were utilized to compare the variance within and among strains using the R package heritability (Kruijer 2016), and finally, the unimodality test was conducted with modetest in the R package multimode (Ameijeiras-Alonso et al. 2021; fig. 3). Additionally, strains with less than three measurements were included when we compared the genome size of F1 lines with the minimum number of subculturing (three times, corresponding to about 50–200 generations of vegetative reproduction), which were isolated on June 23, 2021 and measured on January 6, 2022 (supplementary fig. S2 and table S4, Supplementary Material online).
De Novo Assembly
NIES-4552 was assembled with Canu 1.5 (Koren et al. 2017) with genomeSize = 1 g and -pacbio-raw options. Around 4.5 million subreads with N50 of 17 kb totaling 44 8 Gb were used. We used ntEdit v1.3.5 (Warren et al. 2019) to polish the assembly using DNA short-read data from the NIES-4552. NIES-64, NIES-65, and NIES-4550 were assembled with Canu 1.8 (genomeSize = 1.2 g), and the errors were corrected with Arrow (PacificBiosciences). For NIES-64, 5.2 million reads with N50 of 30 kb totaling 107 Gb; for NIES-65, 8.9 million reads with N50 of 30 kb totaling 110 Gb; for NIES-4550, 5.7 million reads with N50 of 27.9 kb totaling 107 Gb of raw reads were used. For NIES-64 and NIES-65, the arrow-corrected contigs were scaffolded with 10 × genomics linked read data using ARCS (Yeo et al. 2018) after mapping with LongRanger-2.2.2 (10x Software). The reads are mapped again and used for error correction using Pilon for indel errors (Walker et al. 2014). For NIES-64, after inspecting the relationship between the mapped tag number and the reference length, contigs with more than 1 tag per 100 bp reference were retained as authentic contigs (supplementary fig. S14, Supplementary Material online). The rest may represent bacterial or other microbial contamination in the culture. NIES-4550 assembly was error corrected for indel errors with short-read mapping by BWA MEM 0.7.15-r1140 (Li and Durbin 2009) using Pilon. NIES-53 and NIES-54 were assembled with Platanus-allee v2.2.2 with short read and PromethION data (Kajitani et al. 2019). Here, the short reads were trimmed with Trimmomatic 0.39 (Bolger et al. 2014), specifying ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 before assembly, and only paired reads were used. A total of 55.4 and 59.4 Gb were retained for NIES-53 and NIES-54 after trimming, respectively. Nanopore long reads were 1 million reads with N50 of 25.6 kb totaling 14.8 Gb for NIES-53 and 0.76 million reads with N50 of 31.6 kb totaling 12.4 Gb for NIES-54.
For all six assembled genomes, we mapped their paired-end short reads to the assembled genomes of themselves using BWA MEM v.0.7.17 with default parameters (Li and Durbin 2009) and obtained depth information using DepthOfCoverage of GATK v.3.8-1-0-gf15c1c3ef with the default option (McKenna et al. 2010). We then removed the contigs where the percentage identity of nonzero coverage was smaller than 0.001.
Observation of Nuclei
Cells cultured for ∼10 days were collected by centrifugation and fixed with a solution of ethanol and acetic acid (3:1) for 20 min. The fixed cells were then washed with 1 × phosphate-buffered saline (PBS) and stained with 1 µg/ml DAPI in 1 × PBS, followed by observation under a UV fluorescence microscope. The area of nuclei in focus was measured using ImageJ (Schneider et al. 2012).
Mapping of DNA Short Reads
We obtained a set of SNPs of all strains to construct a phylogenetic tree. Paired-end short reads filtered by SolexaQA++ v3.1.7.1 (Cox et al. 2010) were mapped to the de novo assembled sequences of the NIES-4552 strain as a reference using BWA MEM v.0.7.17 with default parameters (Li and Durbin 2009). To preprocess generated SAM files for GATK v4.0.4.0 (McKenna et al. 2010), we converted SAM files to BAM files using SortSam, AddOrReplaceReadGroups, and MarkDuplicates on Picard v.2.27.1 (http://broadinstitute.github.io/picard/). VCF files were created using HaplotypeCaller of GATK v4.0.4.0. In the VCF file created by mapping to the NIES-4552 strain, information of 79,377 SNPs was obtained from all strains using the package VCFR in R (Knaus and Grünwald 2017). Heterozygous SNPs were excluded. We applied the ASC_GTRGAMMA model using RAxML v.8.2.12 with the -f a -m ASC_GTRGAMMA -asc-corr lewis -p 2 -# 100 -x 12345 option (Stamatakis 2014).
Using the same method as presented above, we also mapped paired-end short reads of NIES-58, NIES-59, NIES-64, and NIES-65 to the de novo assembled sequences of the NIES-4550 strain. We obtained depth information using DepthOfCoverage of GATK v.3.8-1-0-gf15c1c3ef with the default option. To understand the differences in depth of each strain by gene, we calculated the average depth of each gene and standardized the coverage by the average depth among the strains.
To identify CNV between NIES-64 and NIES-65 from the DNA depth log-ratio, we mapped paired-end short reads of NIES-64 and NIES-65 to the de novo assembled sequences of the NIES-65 using HISAT2 version 2.1.0 with the --no-spliced-alignment option (Kim et al. 2015). We counted the reads mapped on the exon regions using FeatureCounts v2.0.1 with the -p -t exon option (Liao et al. 2014). The read counts were scaled based on the DNA depth per million (DPM). DPM was calculated by the following way: we first divided the read counts by the length of each gene in kilobases (read per kilobase [RPK]). Then RPK was divided by the sum of RPK for all genes in a million bases. We summed up the DPM for each OG. The DNA depth log-ratio of each OG was defined as the difference in log2-transformed DPM of each OG (NIES-65 − NIES-64). We changed the criteria of the DNA depth log-ratio for CNV depending on the total DNA depth, that is, the sum of the normalized log2-transformed DPM of each OG of NIES-64 and NIES-65. This is because the values of nonzero peaks, indicating CNV, differed depending on the total DNA depth (figs. 6b and S9, Supplementary Material online). This pattern would be expected because the DNA depth ratio should differ depending on the total copy number of two strains. For example, when the total copy number of an OG is three copies (total DNA depth: ∼75), the expected DNA depth ratio should be 1:2 or 2:1, corresponding to the DNA depth log-ratio as −1 or 1. When the total copy number of an OG is 5 (total DNA depth: ∼125), the expected DNA depth ratio should be 2:3, 3:2, 1:4, or 4:1, corresponding to the DNA depth log-ratio as −0.58, 0.58, −2, or 2. Note that the mean DPM of a single-copy gene was 24.64. We considered that a gene has CNV if DNA depth log-ratio >0.5 and a total DNA depth <90, or if DNA depth log-ratio >0.36 and a total DNA depth ≥90.
Gene Prediction
First, we constructed a repeat model for each strain using RepeatModeler Version open-1.0.11 (Smit and Hubley 2008–2015) with the National Center for Biotechnology Information (NCBI) engine and identified repeat sequences using RepeatMasker version open-4.0.7 (Smit et al. 2013–2015; supplementary table S5, Supplementary Material online). Then, genome annotation was performed independently for each strain using EVidenceModeler v.1.1.1 (EVM; Haas et al. 2008). EVM created consensus gene structures from Augustus v.3.3 (Keller et al. 2011) with weight 1 and Program to Assemble Spliced Alignments (PASA) v2.4.1 with weight 20 and 10 in the NIES-4552 genome and the others, respectively (Haas et al. 2003). We first performed training for Augustus using BUSCO v4.0.2 with the −long option based on viridiplantae_odb10 (Simão et al. 2015). We then predicted gene locus using three hint files. The first hint file was generated from the RNA-seq data. After creating a database file of masked genome sequences using Bowtie2 v2.2.6 (Langmead and Salzberg 2012), we mapped raw reads from RNA-seq to the sequences using TopHat v.2.1.0 (Kim et al. 2013). The output was converted to the appropriate hint files created following Bioinformatics Greifswald Software Wiki and Forum (https://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=IncorporatingRNAseq.Tophat). The second hint file was based on the amino acid sequences of other species: A. thaliana (TAIR10_pep_20110103_representative_gene_model_updated), Physcomitrella patens (P.patens.V6_filtered_cosmoss_proteins.fas), Chara braunii (Nishiyama et al. 2018), Spirogloea muscicola, and Mesotaenium endlicherianum (Cheng et al. 2019). Then, the amino acid sequences were mapped using Exonerate (Slater and Birney 2005) v2.4.0 and converted to a hint file using exonerate2hints.pl in Augustus. The third hint file was made from RepeatMasker output following Greifswald Software Wiki and Forum (https://bioinf.uni-greifswald.de/bioinf/wiki/pmwiki.php?n=Augustus.IncorporateRepeats). On PASA, we built a comprehensive transcriptome database using two transcriptome assembly data sets. The first data set comprised genome-guided transcriptome assembly data produced using Trinity v.2.1.1 with the --genome_guided_max_intron 10000 option (Grabherr et al. 2011) based on the RNA-seq data mapped to the masked genome sequences using HISAT2 version 2.1.0 (Kim et al. 2015). The second data set comprised the de novo assembled transcriptome data produced using Trinity. We used the BEDTools complement v2.26.0 (Quinlan and Hall 2010) to obtain lists of the length of intron and intergenic regions.
Dot Plot of Self-Alignment
We performed self-alignment of the four de novo assembled genomes, NIES-4552, NIES-64, NIES-65, and NIES-4550, using D-GENIES with the “many repeats” option (access date: Jan. 2023; Cabanettes and Klopp 2018). From the output files, we deleted the results between contigs that are identical and plotted the alignment using D-GENIES (access date: January 2023).
Copy Number Count and Topology Analysis for Each OG
OGs were identified using amino acid sequences of annotated genes for the six strains. First, we used SonicParanoid with the -I 3 option (fine-grained clustering; Cosentino and Iwasaki 2019) and then identified 29,956 OGs. The output files were used to count the copy number for each OG. Next, we calculated the mean copy number and proportion of genes with CNV per contig using genes with copy numbers <10 (fig. 4).
We compared the distributions of synonymous substitution rates (Ks) for all combinations of genes among strains and within each OG. Ks of all combinations of genes within each strain were calculated using wgd v1.1, showing peaks at approximately Ks = 0.35 (Zwaenepoel and Van de Peer 2019; supplementary fig. S15a, Supplementary Material online). Maximum Ks values for all combinations within each OG were calculated using the R package seqinR (Charif and Lobry 2007). The position of the first peak of the within-OG distribution mostly corresponded to the peak of the pairwise genome-wide Ks distribution, suggesting that the OGs defined by the SonicParanoid software generally clustered orthologous genes (supplementary fig. S15b, Supplementary Material online). Therefore, we only used OGs with the maximum Ks values <1 in subsequent analyses to eliminate OGs containing paralogs that diverged too early.
To identify the origin of duplicated genes, we constructed phylogenetic trees of coding sequences (CDS) from the same OG where zero to four copies of genes were found in all strains. First, we aligned CDS using MAFFT v7.305b with default parameters (Katoh and Standley 2013). We then excluded all sites with gaps from the aligned sequences and constructed phylogenetic trees using IQ-TREE version 1.0 with the -m TIM3 + F + I -B 1000 -keep-ident option (Nguyen et al. 2015). Finally, we evaluated the bootstrap values and the topologies of phylogenetic trees using check_monophyly in the package ete3 (Huerta-Cepas et al. 2016) in Python 3.7. To identify the origin of duplicate genes in the NIES-4550 strain, we focused on the Ogs where NIES-4550 had two copies and the five other strains had a single copy, and treated the NIES-4552 strain as an outgroup (supplementary table S6, Supplementary Material online). When we identified the origin of duplicate genes in five other strains, we removed the leaf nodes from NIES-4550 from the tree and focused on the Ogs where all five strains had two to four copies (supplementary table S7, Supplementary Material online).
Amplification and Sequencing the Putative Recombination Breakpoints
To experimentally validate the putative recombination breakpoints between regions with different presumptive origins, we amplified the genomic regions encompassing the breakpoints by PCR and performed Sanger sequencing (supplementary table S17, Supplementary Material online). We chose regions in which genes with different presumptive origins were adjacent to each other within 10 kb (supplementary table S17, Supplementary Material online). PCR was performed using TaKaRa PrimeSTAR GXL DNA Polymerase following the manufacturer's protocol (Takara Biotechnology, Japan). Amplified DNA fragments were purified by illustra ExoProStar (GE Healthcare Life Sciences). PCR-amplified fragments with the sequencing primers were shipped to Eurofins Genomics K.K. (Tokyo, Japan) for Sanger sequencing (supplementary table S17, Supplementary Material online).
Gene Expression Analysis
To investigate the relationship between copy number and gene expression in the strains NIES-64 and NIES-65, we obtained RNA-seq data with three replicates from cells during five stages: during vegetative reproduction (morning), during vegetative reproduction (night), under a nitrogen-depleted condition, and in conditioned MI medium for 8 and 24 h (fig. 1; see supplementary table S8, Supplementary Material online for details). The conditioned MI medium was a mixture of NIES-64 and NIES-65 that had been cultured in MI medium for 48 h. In addition, we used cells that had been vegetatively grown for 10 days for all stages.
To preprocess paired-end raw reads, we applied fastp (Chen et al. 2018) with the -trim_poly_x option. Next, filtered reads were mapped to the de novo assembled genome sequences of NIES-65 using HISAT2 version 2.1.0 with the default option (Kim et al. 2015). We then counted reads mapped on exons using FeatureCounts v2.0.1 with the -p -t exon option (Liao et al. 2014). To compare expression levels between stages, we produced TMM-normalized data using the package edgeR in R (Robinson et al. 2010). Next, the read counts were scaled based on Transcript per million (TPM). TPM was calculated by the following way: we first divided the read counts by the length of each gene in kilobases (read per kilobase; RPK). Then RPK was divided by the sum of RPK for all genes in a million bases. Finally, we summed up the TPM-scaled read counts mapped to each OG, and then defined OGs with TPM > 5 in all five stages as genes with constitutive expression, and genes with TPM < 0.1 in at least one stage and TPM > 5 in at least one another stage as genes with conditional expression (supplementary table S15, Supplementary Material online).
We determined DEGs with FDR < 0.05 using the package edgeR in R (Robinson et al. 2010). We performed Fisher's exact tests to examine if DEGs were overrepresented in genes with CNV. To identify dosage-compensated genes, we performed the two following tests. First, to test whether the expression differences were smaller than the expected copy number differences, we performed one-sided t-tests with DNA depth ratio for the log2-transformed expression data of the OGs with CNV. We regarded that OGs with FDR < 0.05 had significantly smaller expression level differences than the expected gene dosage differences. Second, to test whether the OGs showed the inverse effect, that is, a significantly higher expression in the low copy number strain than in the high copy number strain, we performed one-sided t-tests for the log2-transformed expression data. We identified OGs with FDR > 0.1 as OGs without a negative dosage response.
GO Analysis
GO term enrichment analysis was performed to test if any GO is enriched in the various gene sets of interest. To determine the GO terms of the predicted genes, we searched the amino acid sequences of genes against the NCBI nonredundant protein database using DIAMOND BLASTP v2.0.15 with default parameters (Buchfink et al. 2021) and the UniProtKB/Swiss-Prot database using BLASTP v2.6.0 + with the -evalue 1e-5 -max_target_seqs 1 option (Altschul et al. 1990). We assigned the 7,666 GO annotations of the 22,809 genes on NIES-65. Then, GO terms enrichment analysis was conducted using topGO (Alexa et al. 2006) and GO.db v.3.12.1 (Carlson et al. 2019) in R. We used Fisher's exact test with the “weight01” algorithm in the R package topGO to reduce redundancies for the statistical significance measurements with a significance level of <0.05 (Alexa et al. 2006).
Supplementary Material
Acknowledgments
This work was supported by JSPS KAKENHI (grant numbers 15K18583, 17K15165, and 22K21352 to T.T., 16H06279 [PAGS] to T.T. and H.S., 20J20877 to Y.W.K., 19K06827 to Y.T., 15H05237 to H.S. and Y.T., 26650147 to H.S., and 16H04836 and 20K21451 to H.S. and T.N.) and by NODAI Genome Research Center. Computations were partially performed on the NIG supercomputer at ROIS National Institute of Genetics. We are grateful to Dr Hiroyuki Koga for flow cytometry analysis.
Contributor Information
Yawako W Kawaguchi, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Bunkyo-ku, Tokyo, Japan; Graduate School of Science and Engineering, Chiba University, Chiba, Chiba, Japan.
Yuki Tsuchikane, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Bunkyo-ku, Tokyo, Japan; Department of Chemical and Biological Sciences, Faculty of Science, Japan Women's University, Bunkyo-ku, Tokyo, Japan.
Keisuke Tanaka, NODAI Genome Research Center, Tokyo University of Agriculture, Setagaya-ku, Tokyo, Japan.
Teruaki Taji, Department of Bioscience, Tokyo University of Agriculture, Setagaya-ku, Tokyo, Japan.
Yutaka Suzuki, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa, Chiba, Japan.
Atsushi Toyoda, Advanced Genomics Center, National Institute of Genetics, Mishima, Shizuoka, Japan.
Motomi Ito, Graduate School of Arts and Sciences, The University of Tokyo, Meguro-ku, Tokyo, Japan.
Yasuyuki Watano, Graduate School of Science, Chiba University, Chiba, Chiba, Japan.
Tomoaki Nishiyama, Research Center for Experimental Modeling of Human Disease, Kanazawa University, Kanazawa, Ishikawa, Japan.
Hiroyuki Sekimoto, Department of Chemical and Biological Sciences, Faculty of Science, Japan Women's University, Bunkyo-ku, Tokyo, Japan.
Takashi Tsuchimatsu, Department of Biological Sciences, Graduate School of Science, The University of Tokyo, Bunkyo-ku, Tokyo, Japan.
Supplementary Material
Supplementary data are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).
Data Availability
Transcriptomic sequence data from the NIES-4550 strain have been deposited in the DNA Data Bank of Japan (DDBJ) under DRA003074. In addition, other raw genomic and transcriptomic sequence data have been deposited in the European Nucleotide Archive (ENA) under BioProject PRJEB57781.
Literature Cited
- Ågren JA, Wright SI. 2011. Co-evolution between transposable elements and their hosts: a major factor in genome size evolution? Chromosome Res. 19:777–786. [DOI] [PubMed] [Google Scholar]
- Alexa A, Rahnenführer J, Lengauer T. 2006. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22:1600–1607. [DOI] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. 1990. Basic local alignment search tool. J Mol Biol. 215:403–410. [DOI] [PubMed] [Google Scholar]
- Ameijeiras-Alonso J, Crujeiras RM, Rodriguez-Casal A. 2021. Multimode: an R package for mode assessment. J Stat Softw. 97:1–32. [Google Scholar]
- Biémont C. 2008. Genome size evolution: within-species variation in genome size. Heredity (Edinb). 101:297–298. [DOI] [PubMed] [Google Scholar]
- Bolger AM, Lohse M, Usadel B. 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics 30:2114–2120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Buchfink B, Reuter K, Drost H-G. 2021. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods. 18:366–368. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cabanettes F, Klopp C. 2018. D-GENIES: dot plot large genomes in an interactive, efficient and simple way. PeerJ 6:e4958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Canon NR, Tell LA, Needham ML, Gardner IA. 2000. Flow cytometric analysis of nuclear DNA for sex identification in three psittacine species. Am J Vet Res. 61:847–850. [DOI] [PubMed] [Google Scholar]
- Carlson M, Falcon S, Pages H, Li N.. 2019. GO. db: a set of annotation maps describing the entire Gene Ontology. Available from:https://bioconductor.org/packages/GO.db/.
- Charif D, Lobry JR. 2007. Seqinr 1.0-2: a contributed package to the R project for statistical computing devoted to biological sequences retrieval and analysis. In: Bastolla U, Porto M, Roman HE, Vendruscolo M, editors. Structural approaches to sequence evolution: molecules, networks, populations. Berlin, Heidelberg, New York: Springer. p. 207–232. [Google Scholar]
- Chen S, Zhou Y, Chen Y, Gu J. 2018. . fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34:i884–i890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cheng S, et al. 2019. Genomes of subaerial Zygnematophyceae provide insights into land plant evolution. Cell 179:1057–1067.e14. [DOI] [PubMed] [Google Scholar]
- Conrad DF, et al. 2010. Origins and functional impact of copy number variation in the human genome. Nature 464:704–712. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cooper GM, et al. 2011. A copy number variation morbidity map of developmental delay. Nat Genet. 43:838–846. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cosentino S, Iwasaki W. 2019. Sonicparanoid: fast, accurate and easy orthology inference. Bioinformatics 35:149–151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cox MP, Peterson DA, Biggs PJ. 2010. SolexaQA: at-a-glance quality assessment of Illumina second-generation sequencing data. BMC Bioinformatics 11:485. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dephoure N, et al. 2014. Quantitative proteomic analysis reveals posttranslational responses to aneuploidy in yeast. Elife 3:e03023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Disteche CM. 2012. Dosage compensation of the sex chromosomes. Annu Rev Genet. 46:537–560. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dolezel J, Göhde W. 1995. Sex determination in dioecious plants Melandrium album and M. rubrum using high-resolution flow cytometry. Cytometry 19:103–106. [DOI] [PubMed] [Google Scholar]
- Dolezel J, Sgorbati S, Lucretti S. 1992. Comparison of three DNA fluorochromes for flow cytometric estimation of nuclear DNA content in plants. Physiol Plant. 85:625–631. [Google Scholar]
- Dopman EB, Hartl DL. 2007. A portrait of copy-number polymorphism in Drosophila melanogaster. Proc Natl Acad Sci U S A. 104:19920–19925. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Driscoll TP, et al. 2020. A chromosome-level assembly of the cat flea genome uncovers rampant gene duplication and genome size plasticity. BMC Biol. 18:70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gasch AP, et al. 2016. Further support for aneuploidy tolerance in wild yeast and effects of dosage compensation on gene copy-number evolution. Elife 5:e14409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gaudet P, Dessimoz C. 2017. Gene ontology: pitfalls, biases, and remedies. In: Dessimoz C, Škunca N. , editors. The gene ontology handbook. New York: (NY: ): Springer New York. p. 189–205. [DOI] [PubMed] [Google Scholar]
- Grabherr MG, et al. 2011. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 29:644–652. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gregory TR. 2005. CHAPTER 1—genome size evolution in animals. In: Gregory TR, editor. The evolution of the genome. San Diego: Elsevier. p. 3–87. [Google Scholar]
- Guo M, Birchler JA. 1994. Trans-acting dosage effects on the expression of model gene systems in maize aneuploids. Science 266:1999–2002. [DOI] [PubMed] [Google Scholar]
- Haas BJ, et al. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31:5654–5666. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Haas BJ, et al. 2008. Automated eukaryotic gene structure annotation using EVidenceModeler and the Program to Assemble Spliced Alignments. Genome Biol. 9:R7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hahne F, et al. 2009. Flowcore: a Bioconductor package for high throughput flow cytometry. BMC Bioinformatics 10:106. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hjelmen CE, et al. 2019. Genome size evolution within and between the sexes. J Hered. 110:219–228. [DOI] [PubMed] [Google Scholar]
- Hose J, et al. 2015. Dosage compensation can buffer copy-number variation in wild yeast. Elife 4:e05462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huang W, et al. 2014. Natural variation in genome architecture among 205 Drosophila melanogaster genetic reference panel lines. Genome Res. 24:1193–1208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huerta-Cepas J, Serra F, Bork P. 2016. ETE 3: reconstruction, analysis, and visualization of phylogenomic data. Mol Biol Evol. 33:1635–1638. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Huettel B, Kreil DP, Matzke M, Matzke AJM. 2008. Effects of aneuploidy on genome structure, expression, and interphase organization in Arabidopsis thaliana. PLoS Genet. 4:e1000226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hwang S, et al. 2021. Consequences of aneuploidy in human fibroblasts with trisomy 21. Proc Natl Acad Sci U S A. 118:e2014723118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ichimura T. 1971. Sexual cell division and conjugation-papilla formation in sexual reproduction of Closterium strigosum. In: Nishizawa K, editor. Proceedings of the 7th International Seaweed Symposium. Tokyo: University of Tokyo Press. p. 208–214. [Google Scholar]
- International Schizophrenia Consortium . 2008. Rare chromosomal deletions and duplications increase risk of schizophrenia. Nature 455:237–241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin J-J, et al. 2020. Getorganelle: a fast and versatile toolkit for accurate de novo assembly of organelle genomes. Genome Biol. 21:241. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kajitani R, et al. 2019. Platanus-allee is a de novo haplotype assembler enabling a comprehensive access to divergent heterozygous regions. Nat Commun. 10:1702. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kanda N, et al. 2017. CRISPR/Cas9-based knockouts reveal that CpRLP1 is a negative regulator of the sex pheromone PR-IP in the Closterium peracerosum-strigosum-littorale complex. Sci Rep. 7:17873. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kasai F, Kawachi M, Erata M, Watanabe MM. 2004. NIES-Collection list of strains seventh edition microalgae and protozoa. Res Rep NIES. 182:1–257. [Google Scholar]
- Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol Biol Evol. 30:772–780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keller O, Kollmar M, Stanke M, Waack S. 2011. A novel hybrid gene prediction method employing protein multiple sequence alignments. Bioinformatics 27:757–763. [DOI] [PubMed] [Google Scholar]
- Kidwell MG. 2002. Transposable elements and the evolution of genome size in eukaryotes. Genetica 115:49–63. [DOI] [PubMed] [Google Scholar]
- Kim D, et al. 2013. Tophat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14:R36. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim D, Langmead B, Salzberg SL. 2015. HISAT: a fast spliced aligner with low memory requirements. Nat Methods. 12:357–360. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Knaus BJ, Grünwald NJ. 2017. VCFR: a package to manipulate and visualize variant call format data in R. Mol Ecol Resour. 17:44–53. [DOI] [PubMed] [Google Scholar]
- Koren S, et al. 2017. Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation. Genome Res. 27:722–736. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kruijer W. 2016. Misspecification in mixed-model-based association analysis. Genetics 202:363–366. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nat Methods. 9:357–359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee H, et al. 2016. Effects of gene dose, chromatin, and network topology on expression in Drosophila melanogaster. PLoS Genet. 12:e1006295. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, et al. 2009. . The Sequence Alignment/Map format and SAMtools. Bioinformatics 25:2078–2079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li Z, et al. 2016. Gene duplicability of core genes is highly consistent across all angiosperms. Plant Cell 28:326–344. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H, Durbin R. 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25:1754–1760. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao Y, Smyth GK, Shi W. 2014. Featurecounts: an efficient general purpose program for assigning sequence reads to genomic features. Bioinformatics 30:923–930. [DOI] [PubMed] [Google Scholar]
- Long Q, et al. 2013. Massive genomic variation and strong selection in Arabidopsis thaliana lines from Sweden. Nat Genet. 45:884–890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lopez FB, et al. 2021. Gene dosage compensation of rRNA transcript levels in Arabidopsis thaliana lines with reduced ribosomal gene copy number. Plant Cell 33:1135–1150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Marçais G, Kingsford C. 2011. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics 27:764–770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McKenna A, et al. 2010. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20:1297–1303. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. 2015. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol 32:268–274. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nishiyama T, et al. 2018. The Chara genome: secondary complexity and implications for plant terrestrialization. Cell 174:448–464.e24. [DOI] [PubMed] [Google Scholar]
- Nojiri T, Fujii T, Sekimoto H. 1995. Purification and characterization of a novel sex pheromone that induces the release of another sex pheromone during sexual reproduction of the heterothallic Closterium peracerosum-strigosum-littorale complex. Plant Cell Physiol. 36:79–84. [Google Scholar]
- Ohtaka K, Sekimoto H. 2023. Zygnematophycean algae: possible models for cellular and evolutionary biology. Semin. Cell Dev. Biol. 134:59–68. [DOI] [PubMed] [Google Scholar]
- Park Y, Kuroda MI. 2001. Epigenetic aspects of X-chromosome dosage compensation. Science 293:1083–1085. [DOI] [PubMed] [Google Scholar]
- Pearson NJ, Fried HM, Warner JR. 1982. Yeast use translational control to compensate for extra copies of a ribosomal protein gene. Cell 29:347–355. [DOI] [PubMed] [Google Scholar]
- Poulíčková A, et al. 2014. DNA Content variation and its significance in the evolution of the genus Micrasterias (Desmidiales, Streptophyta). PLoS One. 9:e86247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan AR, Hall IM. 2010. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26:841–842. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Redon R, et al. 2006. Global variation in copy number in the human genome. Nature 444:444–454. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rice AM, McLysaght A. 2017. Dosage-sensitive genes in evolution and disease. BMC Biol. 15:78. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson MD, McCarthy DJ, Smyth GK. 2010. Edger: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26:139–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rosato M, Chiavarino A, Naranjo C, Hernandez J, Poggio L. 1998. Genome size and numerical polymorphism for the B chromosome in races of maize (Zea mays ssp. mays, Poaceae). Am J Bot. 85:168. [PubMed] [Google Scholar]
- Schneider CA, Rasband WS, Eliceiri KW. 2012. NIH Image to ImageJ: 25 years of image analysis. Nat Methods. 9:671–675. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schukken KM, Sheltzer JM. 2022. Extensive protein dosage compensation in aneuploid human cancers. Genome Res. 32:1254–1270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sekimoto H, et al. 2023. A divergent RWP-RK transcription factor determines mating type in heterothallic Closterium. New Phytol. 237:1636–1651. [DOI] [PubMed] [Google Scholar]
- Sekimoto H, Sone Y, Fujii T. 1994. Regulation of expression of the genes for a sex pheromone by an inducer of the sex pheromone in the Closterium peracerosum-strigosum-littorale complex. Planta 193:137–144. [DOI] [PubMed] [Google Scholar]
- Shure M, Wessler S, Fedoroff N. 1983. Molecular identification and isolation of the Waxy locus in maize. Cell 35:225–233. [DOI] [PubMed] [Google Scholar]
- Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. 2015. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics 31:3210–3212. [DOI] [PubMed] [Google Scholar]
- Slater G, Birney E. 2005. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Šmarda P, Bureš P. 2010. Understanding intraspecific variation in genome size in plants. Preslia 82:41–61. [Google Scholar]
- Smit AFA, Hubley R. 2008–2015. RepeatModeler Open-1.0. Available from:http://www.repeatmasker.org.
- Smit AFA, Hubley R, Green P. 2013–2015. RepeatMasker Open-4.0. Available from:http://www.repeatmasker.org.
- Sriswasdi S, et al. 2016. Global deceleration of gene evolution following recent genome hybridizations in fungi. Genome Res. 26:1081–1090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30:1312–1313. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stelzer C-P, Pichler M, Hatheuer A. 2021. Linking genome size variation to population phenotypic variation within the rotifer. Brachionus asplanchnoidis. Commun Biol. 4:596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stelzer C-P, Pichler M, Stadler P, Hatheuer A, Riss S. 2019. Within-population genome size variation is mediated by multiple genomic elements that segregate independently during meiosis. Genome Biol. Evol. 11:3424–3435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sun H, Ding J, Piednoël M, Schneeberger K. 2018. findGSE: estimating genome size variation within human and Arabidopsis using k-mer frequencies. Bioinformatics 34:550–557. [DOI] [PubMed] [Google Scholar]
- Torres EM, et al. 2007. Effects of aneuploidy on cellular physiology and cell division in haploid yeast. Science 317:916–924. [DOI] [PubMed] [Google Scholar]
- Torres EM, Springer M, Amon A. 2016. No current evidence for widespread dosage compensation in S. cerevisiae. Elife 5:e10996. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsuchikane Y, et al. 2018b. Identification of a new mating group and reproductive isolation in the Closterium peracerosum-strigosum-littorale complex. J Plant Res. 131:735–746. [DOI] [PubMed] [Google Scholar]
- Tsuchikane Y, et al. 2023. Diversity of genome size and chromosome number in homothallic and heterothallic strains of the Closterium peracerosum-strigosum-littorale complex (Desmidiales, Zygnematophyceae, Streptophyta). bioRxiv. Available from:https://www.biorxiv.org/content/10.1101/2023.05.01.538656v1
- Tsuchikane Y, Hamaji T, Ota K, Kato S. 2018a. Establishment of a clonal culture of unicellular conjugating algae. J Vis Exp. 137:e57761. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tsuchikane Y, Ito M, Sekimoto H. 2008. Reproductive isolation by sex pheromones in the Closterium peracerosum-strigosum-littorale complex (Zygnematales, Charophyceae). J Phycol. 44:1197–1203. [DOI] [PubMed] [Google Scholar]
- Tsuchikane Y, Sekimoto H. 2019. The genus Closterium, a new model organism to study sexual reproduction in streptophytes. New Phytol. 221:99–104. [DOI] [PubMed] [Google Scholar]
- Walker BJ, et al. 2014. Pilon: an integrated tool for comprehensive microbial variant detection and genome assembly improvement. PLoS One 9:e112963. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wapinski I, Pfeffer A, Friedman N, Regev A. 2007. Natural history and evolutionary principles of gene duplication in fungi. Nature 449:54–61. [DOI] [PubMed] [Google Scholar]
- Warner JR. 1999. The economics of ribosome biosynthesis in yeast. Trends Biochem Sci. 24:437–440. [DOI] [PubMed] [Google Scholar]
- Warren RL, et al. 2019. Ntedit: scalable genome sequence polishing. Bioinformatics 35:4430–4432. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Watanabe MM, Ichimura T. 1978. Biosystematic studies of the Closterium peracerosum-strigosum-littorale complex III. Degrees of sexual isolation among the three population groups from the northern Kanto area. Bot Mag Tokyo. 91:11–24. [Google Scholar]
- Wickett NJ, et al. 2014. Phylotranscriptomic analysis of the origin and early diversification of land plants. Proc Natl Acad Sci USA. 111:E4859–E4868. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Woodwark C, Bateman A. 2011. The characterisation of three types of genes that overlie copy number variable regions. PLoS One 6:e14814. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yeo S, Coombe L, Warren RL, Chu J, Birol I. 2018. ARCS: scaffolding genome drafts with linked reads. Bioinformatics 34:725–731. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhang J, et al. 2012. Genome size variation in three Saccharum species. Euphytica 185:511–519. [Google Scholar]
- Zhang A, et al. 2017. Global analysis of gene expression in response to whole-chromosome aneuploidy in hexaploid wheat. Plant Physiol. 175:828–847. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zwaenepoel A, Van de Peer Y. 2019. wgd—simple command line tools for the analysis of ancient whole-genome duplications. Bioinformatics 35:2153–2155. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Transcriptomic sequence data from the NIES-4550 strain have been deposited in the DNA Data Bank of Japan (DDBJ) under DRA003074. In addition, other raw genomic and transcriptomic sequence data have been deposited in the European Nucleotide Archive (ENA) under BioProject PRJEB57781.