De novo assembly and characterization of a highly degenerated ZW sex chromosome in the fish Megaleporinus macrocephalus

Carolina Heloisa Souza-Borges; Ricardo Utsunomia; Alessandro M Varani; Marcela Uliano-Silva; Lieschen Valeria G Lira; Arno J Butzge; John F Gomez Agudelo; Shisley Manso; Milena V Freitas; Raquel B Ariede; Vito A Mastrochirico-Filho; Carolina Penaloza; Agustín Barria; Fábio Porto-Foresti; Fausto Foresti; Ricardo Hattori; Yann Guiguen; Ross D Houston; Diogo Teruo Hashimoto

doi:10.1093/gigascience/giae085

. 2024 Nov 26;13:giae085. doi: 10.1093/gigascience/giae085

De novo assembly and characterization of a highly degenerated ZW sex chromosome in the fish Megaleporinus macrocephalus

Carolina Heloisa Souza-Borges ¹, Ricardo Utsunomia ², Alessandro M Varani ³, Marcela Uliano-Silva ⁴, Lieschen Valeria G Lira ⁵, Arno J Butzge ⁶, John F Gomez Agudelo ⁷, Shisley Manso ⁸, Milena V Freitas ⁹, Raquel B Ariede ¹⁰, Vito A Mastrochirico-Filho ¹¹, Carolina Penaloza ¹², Agustín Barria ¹³, Fábio Porto-Foresti ¹⁴, Fausto Foresti ¹⁵, Ricardo Hattori ¹⁶, Yann Guiguen ¹⁷, Ross D Houston ¹⁸, Diogo Teruo Hashimoto ^19,^✉

¹ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

² School of Sciences, São Paulo State University (Unesp), Bauru, SP, 17033-360, Brazil

³ School of Agricultural and Veterinary Sciences, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

⁴ Wellcome Sanger Institute, Cambridge CB10 1RQ, United Kingdom

⁵ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

⁶ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

⁷ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

⁸ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

⁹ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

¹⁰ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

¹¹ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

¹² The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian EH25 9RG, United Kingdom

¹³ The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian EH25 9RG, United Kingdom

¹⁴ School of Sciences, São Paulo State University (Unesp), Bauru, SP, 17033-360, Brazil

¹⁵ Institute of Biosciences, São Paulo State University (Unesp), Botucatu, SP, 18618-689, Brazil

¹⁶ São Paulo Agency of Agribusiness and Technology (APTA), São Paulo, SP, 01037-010, Brazil

¹⁷ INRAE, LPGP, 35042 Rennes, France

¹⁸ The Roslin Institute, University of Edinburgh, Easter Bush, Midlothian EH25 9RG, United Kingdom

¹⁹ Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil

^✉

Correspondence address. Diogo Teruo Hashimoto, Aquaculture Center of Unesp, São Paulo State University (Unesp), Jaboticabal, SP, 14884-900, Brazil. E-mail: diogo.hashimoto@unesp.br

Roles

Carolina Heloisa Souza-Borges: Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Visualization, Writing - original draft, Writing - review & editing

Ricardo Utsunomia: Conceptualization, Funding acquisition, Investigation, Project administration, Resources, Supervision, Writing - review & editing

Alessandro M Varani: Formal analysis, Methodology, Resources, Supervision, Writing - review & editing

Marcela Uliano-Silva: Formal analysis, Methodology, Resources, Supervision

Lieschen Valeria G Lira: Investigation

Arno J Butzge: Investigation

John F Gomez Agudelo: Investigation

Shisley Manso: Investigation

Milena V Freitas: Investigation

Raquel B Ariede: Investigation

Vito A Mastrochirico-Filho: Investigation, Methodology

Carolina Penaloza: Resources, Supervision

Agustín Barria: Resources, Supervision

Fábio Porto-Foresti: Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing

Fausto Foresti: Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing

Ricardo Hattori: Methodology, Resources, Supervision

Yann Guiguen: Funding acquisition, Methodology, Project administration, Supervision, Writing - review & editing

Ross D Houston: Funding acquisition, Project administration, Resources, Supervision, Writing - review & editing

PMCID: PMC11590113 PMID: 39589439

Abstract

Background

Megaleporinus macrocephalus (piauçu) is a Neotropical fish within Characoidei that presents a well-established heteromorphic ZZ/ZW sex determination system and thus constitutes a good model for studying W and Z chromosomes in fishes. We used PacBio reads and Hi-C to assemble a chromosome-level reference genome for M. macrocephalus. We generated family segregation information to construct a genetic map, pool sequencing of males and females to characterize its sex system, and RNA sequencing to highlight candidate genes of M. macrocephalus sex determination.

Results

The reference genome of M. macrocephalus is 1,282,030,339 bp in length and has a contig and scaffold N50 of 5.0 Mb and 45.03 Mb, respectively. In the sex chromosome, based on patterns of recombination suppression, coverage, F_ST, and sex-specific SNPs, we distinguished a putative W-specific region that is highly differentiated, a region where Z and W still share some similarities and is undergoing degeneration, and the PAR. The sex chromosome gene repertoire includes genes from the TGF-β family (amhr2, bmp7) and the Wnt/β-catenin pathway (wnt4, wnt7a), some of which are differentially expressed.

Conclusions

The chromosome-level genome of piauçu exhibits high quality, establishing a valuable resource for advancing research within the group. Our discoveries offer insights into the evolutionary dynamics of Z and W sex chromosomes in fish, emphasizing ongoing degenerative processes and indicating complex interactions between Z and W sequences in specific genomic regions. Notably, amhr2 and bmp7 are potential candidate genes for sex determination in M. macrocephalus.

Keywords: chromosome-level genome, sex chromosome assembly, amhr2, sex determination

Background

The family Anostomidae, native to the Neotropical region, comprises 147 recognized species across 16 genera [1], distributed from northern Colombia to the La Plata River in Argentina [2]. Among these, one of the most economically important genera is Megaleporinus, composed of relatively large species, with adults typically exceeding 35 cm in standard length [3]. Megaleporinus species were previously classified under Leporinus but were reclassified due to the presence of a unique ZZ/ZW sex chromosome system that originated at least 12 million years ago [4].

Historically, cytogenetic studies on Leporinus species have shown that most species, including Leporinus friderici, Leporinus fasciatus, and Leporinus octomaculatus, possess cytologically indistinguishable, or “homomorphic” [5], sex chromosomes [6, 7]. In contrast, some species—such as Leporinus conirostris, Leporinus macrocephalus, Leporinus obtusidens, Leporinus reinhardti, and Leporinus trifasciatus—exhibit differentiated, or “heteromorphic” [5], sex chromosomes [7–9]. This feature was proposed as a synapomorphy for a potential monophyletic group [8], later named Megaleporinus [4].

Known in Brazil as “piauçu,” Megaleporinus macrocephalus (NCBI:txid327003; marinespecies.org:taxname:1524119) is an omnivorous anostomid that consumes small fruits, seeds, small fish, and crabs [10]. This species is found in the Paraguay River [11] and exhibits sexual size dimorphism, with females being larger than males [3]. Moreover, it is the only aquaculture species in Brazil with a well-established ZZ/ZW heteromorphic sex chromosome system [12]. The W chromosome is the largest in its karyotype, with its long arms being entirely heterochromatic. In contrast, the Z chromosome is a medium-sized metacentric chromosome with small portions of heterochromatin confined to the ends of its long arms [7].

Currently, there are limited genomic resources available for piauçu in public databases. These resources consist mainly of fragments of mitochondrial and nuclear genes and microsatellite sequences used for genetic monitoring in aquaculture [13], species identification [14–16], phylogenetic analyses [15, 17, 18], and population studies [19]. To our knowledge, there is no reference genome available for this species.

Although the sex chromosome system of M. macrocephalus was identified cytologically over 40 years ago [7], the current knowledge of the Z and W chromosomes is limited to repetitive sequences, particularly from a chromosome-scale perspective, that is, there is no in-depth information about the genomic structure (nonrecombining and pseudoautosomal regions), gene contents, transcription status, and patterns of recombination. The species exhibits a significant expansion of satellite DNA (satDNA) in its genome compared to other Characoidei fish [20–22]. This expansion results from the duplication of existing satellites followed by substitution/deletion/insertion events, as well as other unknown mechanisms [12]. Through low-coverage sequencing of male and female individuals, [12] constructed a satellitome for the species and mapped sex-biased satellites using fluorescence in situ hybridization (FISH). Approximately 18% of the satDNAs had differentially accumulated within the heteromorphic sex chromosomes of M. macrocephalus, suggesting a high degree of differentiation between the Z and W chromosomes, likely due to the loss of recombination between these chromosomes [23]. As expected in a monophyletic ZZ/ZW system, they identified some satellites conserved in the W chromosomes of both M. macrocephalus and Megaleporinus obtusidens, indicating that these satellites were present in the common ancestor of these species before their evolutionary split. Moreover, there were satellites with differential accumulation or exclusive to the piauçu W chromosome, which highlights the occurrence of an independent and continuous differentiation of the W chromosomes in this genus.

While there have been disruptive advances in sequencing technologies for assembling high-quality genomes of nonmodel species, such as long-read sequencing and scaffolding techniques like Hi-C sequencing [24–26], sex chromosomes have been notoriously difficult to assemble due to their high divergence in the heterogametic sex and high repeat content [27]. Recently, several new Y chromosome assemblies have been reported in fish, such as in the zigzag eel [28], threespine stickleback [29], Atlantic herring [30], and the neo-Y chromosome of the spotted knifejaw [31]. However, there have been few reports of fish W chromosome assemblies. A decade ago, the first W chromosome assembly in fish was completed for the tongue sole, Cynoglossus semilaevis [32]. Since then, only a few well-characterized W chromosomes in fish have been made available, such as in Verasper variegatus [33].

The study of W chromosome evolution in vertebrates continues to be constrained by the scarcity of W assemblies; therefore, it is of paramount importance to sequence additional W models. M. macrocephalus is an ideal model for studying the structure and evolution of Z and W chromosomes in fish as it belongs to a rare group with conserved ZW chromosomes. Understanding the detailed genomic architecture of these chromosomes can provide insights into mechanisms of sex chromosome differentiation and degeneration in fish. Furthermore, as an important species in Brazilian aquaculture, discovering sex-related genomic resources for M. macrocephalus can aid conservation efforts and breeding programs, supporting genomic selection and female-biased sex control in aquaculture. In this study, we aimed to (i) assemble a chromosome-level genome of M. macrocephalus, including the sex chromosome; (ii) assess patterns of recombination in the sex chromosome by linkage mapping; (iii) characterize the genomic regions of the sex chromosome by resequencing of male and female individuals; and (iv) identify candidate genes for sex determination by RNA sequencing (RNA-seq) experiments.

Results

Chromosome-level genome assembly

Genome assembly

We generated 88.8 Gb of Pacific Biosciences (PacBio) continuous long reads (CLRs), 85 Gb of MGISEQ short reads, and 105 Gb of Hi-C data. The genome coverage based on final assembly size was 69.4×, 66.4×, and 82×, respectively. After removing poor-quality sequences from the short reads, we retained 82 Gb of clean data. This dataset was used to generate k-mer spectrum plots to estimate the overall characteristics of the genome. All k-mer plots were similar and indicated a low heterozygosity rate (Fig. 1A). The estimated genome size (based on 21-mer) was 1.02 Gb with a heterozygosity of 0.50% and 16.9% of repeat content.

Figure 1: — (A) A k-mer profile of MGISEQ short reads. (B) A k-mer analysis of the *Megaleporinus macrocephalus* genome bases against its sequenced MGISEQ reads. The results show that (i) the distribution of the k-mers in the assembly is consistent with the short read profile; (ii) 2 peaks are demonstrating that 1-copy (heterozygous) and 2-copy (homozygous) k-mers were found once in the assembly, as expected for a pseudo-haplotype genome [36]; (iii) most of the assembly k-mers (in red) are unique, indicating that the assembly has a low content of artificial duplications (i.e., k-mers found twice, in blue); (iv) there are missing k-mers in the assembly (black peak), which is compatible with haploid genomes; and (v) the 1-copy k-mer peak (red) is greater than its missing sequences (black), which suggests that Falcon-Unzip erroneously included sequences from both haplotypes into the primary pseudo-haplotype [36].

PacBio long reads were assembled using Falcon-Unzip [34], resulting in 2,770 primary contigs, including 33 contigs >5 Mb, with an N50 of 1.53 Mb. After gap filling, the contigs were clustered into 1,227 scaffolds, achieving an N50 of 5.0 Mb. These scaffolds were then ordered and oriented into 27 chromosomes, consistent with the species’ haploid chromosome number [35], along with 73 unplaced scaffolds (<250 kb). The 27 chromosomes accounted for 99.56% of the complete genome assembly.

The final M. macrocephalus reference genome contains 27 chromosomes and 73 unplaced scaffolds. It has a contig and scaffold N50 of 5.0 Mb and 45.03 Mb, respectively, with an assembled genome size of 1.28 Gb (Table 1).

Table 1:

Statistics for genome assembly of Megaleporinus macrocephalus.

Characteristic	Value
No. scaffolds	101
No. contigs	1,353
Main genome scaffold sequence total (bp)	1,282,030,339
Main genome contig sequence total (bp)	1,280,781,659
Scaffold N50 (bp)	45,034,219
Contig N50 (bp)	5,013,076
Max. scaffold length (bp)	73,843,892
Max. contig length (bp)	25,940,738
% main genome in scaffolds >50 kb	99.9%
BUSCO complete	96.2%
BUSCO complete and single copy	95.1%
BUSCO complete and duplicated	1.1%
BUSCO fragmented	0.6%
BUSCO missing	3.2%
Consensus quality value (QV)	37.53
Merqury completeness	93.05%

Open in a new tab

We used the highly accurate short reads to plot Merqury [36] evaluation against the genome k-mers (Fig. 1B). The accuracy of the base calls (quality value [QV]), which is calculated using the k-mers found only in the assembly (bar at the beginning of Fig. 1B), was 37.53 (Table 1) and represents a base accuracy >99.9% (e.g., QV = 30 means 99.9% accuracy). The completeness score shows that 93.05% of k-mers in the MGISEQ reads are present in the assembly, which is a good recovery of k-mers for a species with 0.5% heterozygosity.

Pearson’s correlation between the autosome’s assembled size with its actual karyotypic size (Supplementary Table S1) was 99%, demonstrating the high quality of the assembled M. macrocephalus genome.

Sex chromosome

Chromosome 13 was recognized as the sex chromosome based on the following evidence:

In the Hi-C contact map, we observed lower coverage in the upper segment of this chromosome compared to its terminal segment and other chromosomes. This upper segment is assumed to correspond to the W-specific region (hemizygous), while the terminal segment corresponds to the pseudo-autosomal region of the Z and W chromosomes (Fig. 2).
Linkage group 24 (LG24) showed collinearity with chromosome 13 (Fig. 3B). In the linkage map, LG24 exhibited suppressed recombination, with varying intensities in the female and male maps (Fig. 4).
The comprehensive examination of single-nucloetide polymorphism (SNP) distribution through resequencing analysis revealed a robust sex-linked signal in females and elevated fixation index (F_ST) values within chromosome 13 (see Fig. 5).

Figure 2: — Hi-C contact map of *Megaleporinus macrocephalus* with the sex chromosome indicated by an arrow.

Figure 3: — (A) Linkage maps of *Megaleporinus macrocephalus* with 28 linkage groups and 11,231 SNPs. Male and female maps are shown at the top and bottom, respectively, with marker density indicated by a color gradient from blue (low density) to red (high density). (B) Collinearity analysis of linkage groups and chromosomes.

Figure 4: — Position of markers in the genome (in Mb) versus their position on the genetic map (in cM) for LG24. The female map is shown at the top and the male map at the bottom. Recombination varies on the physical map but not on the genetic map, forming vertical structures of clusters known as zero recombination clusters, which are highlighted by circles.

Figure 5: — Plots of F_ST, female-specific SNPs, and male-specific SNPs across the 27 chromosomes of the *Megaleporinus macrocephalus* genome. Higher F_ST values and the presence of sex-specific SNPs in chromosome 13 confirm the ZW sex determination system.

As expected, due to the PacBio CLR approach, which results in a high error rate (∼15%) [37], it was not possible to determine the identity of the contigs (whether they are Z or W reads) belonging to the non-recombining region of the sex chromosomes. Consequently, accurate phased assembly of the sex chromosomes could not be achieved in this study. Thus, we cannot conclude whether chromosome 13 is the W chromosome or the Z chromosome. Therefore, chromosome 13 is represented as a consensus of the Z and W chromosomes and will henceforth be referred to as the sex chromosome.

Repeat annotation

Using the de novo prediction model, 2,544 new families of repeats were identified in the genome. The repeat content in M. macrocephalus accounted for 46.71% of the genome (598 Mb). Among the repeats, transposable elements (TEs) were the most common, representing 37.49% of the genome. DNA transposons were the most abundant TEs (11.82%), followed by long terminal repeats (LTRs) at 3.02%, long interspersed nuclear elements (LINEs) at 3.42%, and short interspersed nuclear elements (SINEs) at 0.33% (Supplementary Table S2). A significant portion (18.89%) of the interspersed repeats remained unclassified. Despite using the species’ satellitome [12] to identify the satellite DNAs, these repeats accounted for only 4.40% of the genome (Supplementary Table S2).

Supplementary Figure S1 shows older TE copies located on the right side of the graph, while more recent ones, which do not diverge much from the consensus TE sequence, are on the left side. Most of the interspersed repeat content found in the M. macrocephalus genome is recent (K values <25). Also, 2 bursts of transposition dominated by DNA transposon are observable.

The repeat content found in the sex chromosome was slightly higher than in the autosomes (4.24%). The total interspersed repeats and satellites classes showed the most significant differences, 2.37% and 2.24%, respectively (Table 2).

Table 2:

Comparison between the repeat content in the sex chromosome and the autosomes of the Megaleporinus macrocephalus genome

	% Sex chromosome	% Autosomes
Repeat content	50.95	46.71
Retroelements	7.86	6.78
DNA transposons	12.07	11.82
Unclassified	19.92	18.89
Total interspersed repeats	39.86	37.49
Small RNA	0.02	0.03
Satellites	6.64	4.40
Simple repeats	3.72	4.06
Low complexity	0.38	0.40

Open in a new tab

Gene prediction and annotation

For ab initio gene prediction, BRAKER1 [38] used 28.26 Gb of RNA-seq data as extrinsic evidence to predict 60,482 genes. For homology-based gene prediction, BRAKER2 [39] generated 57,574 hints and predicted genes. TSEBRA [40] combined BRAKER runs and selected 44,054 best gene predictions. Of these, 66.94% (29,490) were annotated in the Actinopterygii database of UniProtKB [41] or eggNOG [42], and 33.06% (13,525) were not annotated. We kept the annotated (29,490) and the nonannotated predicted genes with more than 150 amino acids (1,039) for the final dataset, summarizing 30,501 protein-coding predicted genes (Supplementary Table S3). The final dataset had 94.1% complete BUSCO, 89% complete and single-copy, 5.1% duplicated, 2.2% fragmented, and 3.7% missing BUSCO. For the functional annotation, we performed blast searches against the Actinopterygii database of UniProtKB [41] and eggNOG [42]. Of all the predicted genes, only 3.34% (1,018) were not annotated.

The most representative Gene Ontology (GO) terms (>15% of genes) according to the 3 domains can be seen in Supplementary Fig. S2.

Linkage map

A total amount of 1,307,500,332 raw reads were sequenced using double digest restriction site-associated DNA sequencing (ddRADseq), resulting in approximately 200 Gb of data (about 28 Gb per library). After filtering to remove low-quality sequences and reads with missing or ambiguous barcodes, an average of 11% of the reads were removed from each library, retaining 89% of the reads for analysis. Additionally, 24 individuals were excluded due to a low number of reads (<1 million). The average number of reads per sample was 4.3 million. The raw sequencing data and filtered reads for each library are detailed in Supplementary Table S4.

After mapping the ddRAD reads to the chromosome-level genome, SNP calling analysis identified 41,033 SNPs from 85,167 loci across 281 individuals. Using Plink [43], we applied the mind and geno filters, excluding 56 individuals and 8,971 SNPs. The maf filter further excluded 3,733 SNPs. Consequently, 225 individuals and 28,329 SNPs passed all quality controls (with a total genotyping rate of 0.96) and were used for linkage mapping.

A pedigree test was performed, and individuals with more than 10% Mendelian errors were removed. After calling possible missing or erroneous parental genotypes using the ParentalCall module, a total of 9,997 SNPs were grouped into linkage groups (LGs). We calculated several logarithm of odds (LOD) scores between markers and selected the best marker distribution based on the species’ karyotype characteristics. Although the haploid chromosome number for M. macrocephalus is 27, the best marker distribution was achieved using 28 LGs (with LOD 12), likely due to a specific region of the Z chromosome forming a separate linkage group (Supplementary Fig. S3). The remaining markers were assigned to existing LGs using LOD 10, which recovered 1,234 markers. A total of 18,098 markers were discarded because they did not associate with the linkage map. Within each LG, the order of markers with the best likelihood was combined to produce the final linkage map. A total of 11,231 SNPs were assigned to 28 LGs. We constructed male, female, and sex-averaged maps (average position between male and female maps) (Fig. 3).

The number of SNPs in the LGs ranged from 710 (LG1) to 203 (LG28). In the sex-averaged map, LG length varied from 143.08 cM (LG22) to 43.25 cM (LG24), with an average of 3,320.36 cM and an average distance between markers of 0.29 cM (SD = 0.12). The highest and lowest marker densities were found in LG1 and LG22, with averages of 0.18 cM and 0.61 cM, respectively (Supplementary Table S5).

Regarding sex-specific differences, the average distances between markers in male and female maps were 0.31 cM and 0.29 cM, respectively. Consequently, the male map (3,518.24 cM) was longer than the female map (3,301.97 cM). The male-to-female genetic length ratio across the entire genome was 1.07, with ratios varying from 0.54 (LG24) to 1.37 (LG12). The highest recombination density was detected in the proximal region of the centromeres (considering that this species has metacentric/submetacentric chromosomes), although some exceptions occurred in the terminal region of the LG11 (Fig. 3).

Recombination suppression within LG24

In LG24, recombination was distributed differently between the sexes (heterochiasmy). The female map for LG24 was almost double the size (87.81 cM) compared to the male map (47.37 cM). Collinearity analysis revealed a correspondence between LG24 and sex chromosome 13 (Fig. 3), particularly in regions <20 Mb and >40 Mb. Additionally, zero recombination clusters (areas of recombination suppression) were observed in LG24 for both sexes (Fig. 4), characterized by blocks of markers that vary in physical distance (bp) but remain consistent in genetic distance (cM). Furthermore, the same chromosome 13 was also associated with LG27, which corresponds to the pseudoautosomal region (PAR) (Fig. 3B). This suggests that the optimal LOD value resulted in 28 linkage groups (n = 27 chromosomes), as both LG24 and LG27 correspond to different regions of the same sex chromosome. It is important to note that a small part of LG24 also showed collinearity with autosomes (chr 25 and chr 27).

Discordance between physical and genetic mapping

To integrate the genome assembly with the linkage map data, we ordered the genome scaffolds using the linkage map as a reference. Chromonomer [44] attempts to identify and remove markers that are out of order in the genetic map when considered against their local assembly order and to identify scaffolds that have been incorrectly assembled according to the genetic map, splitting those scaffolds as necessary. This ordering grouped 1,575 map markers into 352 scaffolds. The remaining SNP loci were not used for genome anchoring because they were not aligned to the piauçu scaffolds or were markers mapping to multiple regions or loci where the orientation could not be suitably assigned. These results allowed the construction of a chromonome that clustered 75% (1,221,855,406 bp) of the initial scaffold data into 27 pseudomolecules (chromosomes) totalizing 977 Mb in length. However, 320 Mb were not anchored in pseudomolecules. LG24 anchored a low number of scaffolds, resulting in a pseudomolecule with poor scaffolding and small size (∼4 Mb). This can be explained by the region of sex conflict between the Z and W chromosomes and the consequent suppression of recombination between them.

The dot plot synteny analysis demonstrated a high degree of concordance between the chromosomes scaffolded with Hi-C data (physical mapping) and those scaffolded with the genetic map (chromonome) (Supplementary Fig. S4, illustrated by chromosomes 5, 8, and 20). Insertions and deletions were observed in all chromosomes (e.g., chromosome 2, where 25% of the initial scaffold data was missing). Additionally, structural differences were noted in some chromosomes, revealing relocations (chromosomes 1 and 21) and major inversions (chromosomes 3, 18, and 19).

Highly differentiated regions in the ZW chromosome

Whole-genome resequencing of male and female pools yielded 266,697,484 and 231,722,384 paired-end clean reads, respectively. These reads were then mapped to the female chromosome-level genome to identify genomic regions enriched for sex-biased signals, such as differences in coverage between males and females or sex-biased SNPs. The mapping rates of paired-end reads from the male and female pools were 98.74% and 97.08%, respectively, with average depths of 25× for the male pool and 24× for the female pool.

The SNP distribution analysis revealed a strong sex-linked signal in females and high F_ST values in the sex chromosome (chromosome 13) (Fig. 5). This profile supports the presence of a female heterogametic system (ZW/ZZ), as previously reported by cytogenetic data [7].

Distinct patterns in the sex chromosome

The sex chromosome, which is approximately 45 Mb in length, was divided into 3 regions according to the overall characteristics of read depth (coverage), the pattern of sex-specific SNPs, and F_ST. The first region comprises the beginning of the chromosome, from 0 to approximately 3 Mb, and is characterized by high coverage in females (2.3-fold the female average depth) and low coverage in males (0.6-fold the male average depth; Fig. 6). The absence of coverage in males was detected in some areas (depth ratio ≈ 0). In addition, the major F_ST peak (F_ST = 0.17) was located within this region (Fig. 6). The observed patterns strongly suggest the assembly of W-specific sequences in this region. This region was named a putative W-region (PWR), characterized by high differentiation, as confirmed by the absence of recombination in the linkage map (Fig. 4).

Figure 6: — The sex chromosome of *Megaleporinus macrocephalus*, divided into 3 major regions: putative W-region (PWR), from 0 to 3 Mb; chimera (CHR), from 3 to 20 Mb and 44 to 45 Mb; and pseudoautosomic region (PAR), from 20 to 44 Mb. (A) Distribution of sex-specific SNPs, with red indicating female-specific SNPs and blue indicating male-specific SNPs. (B) Male-to-female (M/F) depth ratio (absolute depth of males/absolute depth of females). A depth ratio of 1 indicates equal read coverage in both sexes; a depth ratio >1 indicates higher coverage in males and a depth ratio <1 indicates higher coverage in females. (C) Fixation index (F_ST) values, highlighting regions of high differentiation.

The second region encompasses approximately 3 to 20 Mb, reaching the opposing terminal segment of the sex chromosome, which ranges from ∼44 to 45 Mb (Fig. 6). Within this zone, 2 distinct patterns were recognized. The first, prevalent in most of the region, exhibited Z-specific characteristics, as males demonstrated at least double the coverage of females (i.e., males have 2 copies of Z, while females have 1). The other pattern was characterized by a high density of female-specific SNPs, with peaks summarizing more than 2,000 SNPs, representing allelic differences between Z and W sequences. It is important to note that the SNPs identified in the female pool may also include those specifically from the Z chromosome. However, these are likely in low quantity, as the male pool showed a low number of Z-specific SNPs, with an average of 13 SNPs per kilobase. This low quantity likely does not affect the demarcation of the region of sex conflict.

Furthermore, in the areas where W sequences were observed, males had no coverage, and females had 2.6-fold the average depth, resulting in a depth ratio of approximately 0 (Fig. 6). This observation was supported by a higher recombination frequency in males compared to females within this region of LG24 (3 to 20 Mb and 44 to 45 Mb, as illustrated in Fig. 4). This evidence indicates a certain degree of similarity between the Z and W sequences, allowing them to be scaffolded in the same region. Therefore, this locus was named “chimera” (CHR), which is undergoing degeneration, as indicated by differences between females and males in recombination suppression, coverage, F_ST, and number of SNPs.

The region comprising approximately 20 to 44 Mb was characterized by a lack of sex-specific SNPs (Fig. 6). In this genomic locus, an almost equal absolute depth between males and females was also observed (depth ratio ≈ 1; Fig. 6). This illustrates homology between the male and female sequences in this zone and could indicate normal recombination rates, as seen in pseudo-autosomal regions. Therefore, we named this region the PAR.

Differential expression between males and females

A total of 28.26 Gb of gonadal paired-end RNA-seq data was pseudo-aligned with 30,500 transcripts of M. macrocephalus (Supplementary Table S6). Approximately 99.9% (30,460) of the RNA-seq transcripts were successfully pseudo-aligned. After filtering out low counts (≤1), 27,120 transcripts remained for differential expression analysis. Principal component analysis (PCA) revealed that principal component 1 (PC1) accounted for 78% of the variance in the data. As expected, the samples clustered into 2 groups along PC1: ZZ males and ZW females, with minor intravariation observed within the male group (Fig. 7A). This clustering pattern was corroborated by the heatmap of the Euclidean distance matrix (Fig. 7B).

Figure 7: — Clustering of RNA-seq gonadal samples of *Megaleporinus macrocephalus* by transcript expression. (A) PCA plot. (B) Heatmap of the Euclidean distance matrix. (C) Volcano plot indicating downregulated (left, red) and upregulated (right, red) transcripts. Nonsignificant transcripts (Padj > .01) are shown in black.

The analysis identified 2,557 differentially expressed transcripts (Padj ≤ .01). Of these, 42.66% (1,091) were upregulated in males and 57.33% (1,466) in females (Table 3 and Fig. 7C). Most of these differentially expressed (DE) transcripts were components of the zona pellucida and were upregulated in females (Supplementary Table S7).

Table 3:

Differentially expressed genes in gonads of males (ZZ) and females (ZW) of Megaleporinus macrocephalus during sex determination. (A) Comparison of differentially expressed genes between autosomes and the sex chromosome. (B) Distribution, expression, and differential expression of genes in sex chromosome regions. Regions in the sex chromosome: putative W-region (PWR), chimera (CHR), and pseudoautosomal region (PAR).

A.
Male ZZ vs. Females ZW	Autosomes		Sex chromosome regions
	Total	Average/chr	PWR	CHR	PAR	Total
Differentially expressed^a	2,557	95	12	82	54	148
Male upregulated^b	1,091	40	5	62	26	93
Female upregulated^c	1,466	54	8	21	29	58
B.
	Size (Mb)	No. of genes	E ^d	% E ^e	% DE ^f
PWR	3	75	50	66.67	24.00
CHR	18	465	359	77.20	22.84
PAR	24	472	442	93.64	12.22

Open in a new tab

Padj ≤ .01.

LFC ≥ 1, Padj ≤ .01.

LFC ≤ −1, Padj ≤ .01.

Number of expressed genes.

Percentage of expressed genes (count >1).

Percentage of expressed genes that were differentially expressed.

Within the sex chromosome, males exhibited a higher number of upregulated and DE genes compared to the average per chromosome. The PWR had the highest concentration of DE genes. Additionally, males showed significantly more upregulation in the CHR than females, a trend also observed in both the PWR and the PAR (Table 3A).

The PWR displayed the lowest percentage of expressed genes, suggesting potential functional loss in some genes. Despite its size of only 3 Mb, 24% of its expressed genes were DE (Table 3B). In contrast, the PAR, which spans 24 Mb, had only 12.22% of DE genes. The lower expression in the PWR and CHR compared to the PAR indicates a degree of degeneration in these regions (Table 3B).

The sex chromosome harbored several genes related to sex determination and differentiation, as listed in Table 4. None of the candidate sex genes have paralogs on autosomes, simplifying their analysis and interpretation. The PAR contains the largest number of sex candidate genes, with 5 genes, followed by the CHR and the PWR, with 4 genes and 1 gene, respectively. Except for igfbp6b, all genes in the CHR exhibit a high density of female-specific SNPs (gene average density >500 SNPs), with wnt4 showing the highest density. These genes supposedly contain sequences on both the Z and W chromosomes. The Z sequences align regularly with the reference genome, whereas the W sequences, although homologous to the reference genome, display allelic differences seen as SNPs. In contrast, all candidate genes present a low density of male-specific SNPs. Igfbp6b, wnt4, sox12, foxp4, amhr2, and wnt7a are upregulated in males, while ccdc114, ccdc71, sox13, and bmp7 are upregulated in females. Notably, igfbp6b and amhr2 are differentially expressed and upregulated in males.

Table 4:

Sex determination–related genes identified on the Megaleporinus macrocephalus sex chromosome, including their location, average density of the gene female-specific SNPs (F-SNPs) and male-specific SNPs (M-SNPs), log fold change (LFC), and P-adjusted values. Regions in the sex chromosome: putative W-region (PWR), chimera (CHR), and pseudoautosomal region (PAR).

Gene ID	Description	Region	F-SNPs	M-SNPs	LFC^a	Padj
ccdc114	Coiled-coil domain containing 114	PWR	1	2	−5.50	1
igfbp6b ^b	Insulin-like growth factor binding protein	CHR	4	2	5.19	3.48e⁻¹³
ccdc71	Coiled-coil domain containing 71	CHR	519	6	−1.25	1
wnt4	Wingless-type MMTV integration site family, member 4	CHR	1,663	17	1.57	1
sox13	SRY (sex determining region Y)–box 13	CHR	1,426	1	−0.50	1
sox12	SRY (sex determining region Y)–box 12	PAR	9	10	3.45	9.80e⁻²
foxp4	Forkhead box	PAR	14	23	1.54	1
bmp7	Bone morphogenetic protein	PAR	14	15	−1.67	6.88e⁻¹
amhr2 ^b	Anti-Müllerian hormone receptor type II	PAR	1	0	3.80	1.40e⁻⁶
wnt7a	Wingless-type MMTV integration site family, member 7A	PAR	105	16	3.71	1.73e⁻¹

Open in a new tab

If LFC ≥ 1, upregulated in males; if LFC ≤ −1, upregulated in females.

Means differentially expressed genes (Padj ≤ .01).

Since the Z and W chromosomes are not phased, we were unable to perform in-depth genomic analyses of the evolutionary processes of sex chromosomes, such as the detailed structure of genes in the non-recombining region. This limitation prevents the estimation of sequence divergence estimates in coding sequences or even the gene loss rate. Additionally, the lack of genomic resources from closely related species also prevents comparative synteny analysis. The closest related species with available genomic resources belong to the family Serrasalmidae, which diverged from Anostomidae approximately 70 million years ago [45].

Discussion

Chromosome-level genome

In this study, we employed multiple genome sequencing strategies to assemble a chromosome-level reference genome for M. macrocephalus, a Neotropical fish species with heteromorphic sex chromosomes. Despite the significant morphological and ecological diversity within Characiformes [45], genomic data for this order remain scarce.

The piauçu genome demonstrated quality, contiguity metrics (contig and scaffold N50), and size comparable to other available assemblies of Neotropical fish (Supplementary Table S8). The repeat content found in the piauçu genome (46.71%) was intermediate compared to other Neotropical fish species, such as C. macropomum (52.49%) [46] and A. mexicanus (41%) [47]. DNA transposons were the most abundant type of transposable elements, accounting for 11.82% of the genome, consistent with observations in other teleost fish [48]. A substantial portion (18.89%) of the interspersed repeats remained unclassified, similar to findings in tambaqui [46] and red-bellied piranha [49] genomes, which reported 39.15% and 28.3% unclassified sequences, respectively. Overall, the repeat content in the sex chromosome was slightly higher than in the autosomes (4.24%), which is expected. In the repeat landscape (Supplementary Fig. S1), 2 bursts of transposition dominated by DNA transposons were observed, similar to what has been reported in other teleost fish, such as Nile tilapia [50].

Genome annotation identified 30,501 predicted protein-coding genes (Supplementary Table S3), which is consistent with other related Neotropical fish genomes, such as tambaqui [46], red-bellied piranha [49], and cavefish [47] (31,149, 30,575, and 25,293, respectively).

Satellite DNA

SatDNAs are composed of arrays with nearly identical repeating units. These units can range in length from single base pairs (mononucleotide repeats) to several megabases without interruption [51]. The considerable length of these arrays presents a challenge for modern sequencing, assembly, and mapping techniques, making the analysis of long fragments problematic [52]. Long reads are more capable of characterizing the variable satellite content and assembling difficult, repetitive parts of the genome [53].

Piauçu is noted for having the highest number of characterized satellites for any species so far [12]. However, even using the species’ satelitome in the repeat annotation, the amount of satDNA found in our assembly was significantly lower than expected. This suggests that:

The long reads were not capable of capturing the complete arrays of these repeats. High-identity regions, such as tandem repeats, often collapse during assembly with short or erroneous long reads [53, 54].
The pipelines used to annotate the repeats were not effective in identifying the satellite DNA arrays. According to [55], computational tools that account for the high error rates of long-read technologies are lacking. Using personalized pipelines, like those in [29, 56, 57], or tools specifically designed for satellite analysis in long reads, such as NCRF [55], tandem genotypes [58], P ACMON STR [59], TandemTools [60], and Winnowmap2 [61], could improve the satellite DNA annotation results.
SatDNAs may have been removed during the scaffolding process. Conventional Hi-C analysis often fails to account for reads that map to multiple locations, underestimating biological signals from repetitive genome regions [53, 57]. This disproportionately affects repetitive parts of the genome, such as sex chromosomes [29, 53, 57, 62].

Linkage map

In this study, we achieved a resolution similar to other linkage maps constructed for related Neotropical fish species with analogous karyotype characteristics (haploid chromosome number, morphology, and size) (Supplementary Table S9). The sex chromosome showed collinearity with 2 linkage groups, LG24 and LG27, which represent the Z-recombining region and PAR of the sex chromosome, respectively. This pattern was previously reported in the butterfly Melitaea cinxia [63], which also has a ZW sex chromosome system. LG24 consisted of markers that followed a Z chromosomal inheritance pattern, where female offspring are homozygous for one of the father’s alleles. This explains the strong heterochiasmy observed in this LG, with a higher recombination in the male map. Although LG27 was separated by the LG24 linkage pattern, it is physically merged with LG24 in the genome and corresponds to the pseudo-autosomal region of the sex chromosome, as similarly observed in M. cinxia [63].

The linkage map for male piauçu was longer than that for females, with a genetic length ratio of 1.07. Recently, the tambaqui C. macropomum was described as having a hypothetical XY sex determination system [64], despite not presenting heteromorphic sex chromosomes. In contrast, [65] found that the female linkage map in tambaqui was larger than the male map (1.55×). Differences in map length can result from variations in the number of recombination events in the 2 parents, as well as differences in the number and location of the mapped loci. It is common to find differences in recombination ratios between the sexes in most aquatic species [66–69]. Despite this being a common phenomenon, the mechanism responsible for different recombination rates between the sexes is still not well understood [70]. This explains the opposite sex-specific differences observed between piauçu and tambaqui and suggests that the heterogametic sex tends to have smaller maps due to recombination suppression [69].

The influence of the sex determination system on sex-specific recombination patterns has also been described for other fish lineages. In flatfish species such as turbot [71], Senegalese sole [72], and Atlantic halibut [69], female maps were larger (1.36, 1.32, and 1.07 times, respectively). Conversely, in Japanese flounder Paralichthys olivaceus [68] and tongue sole Cynoglossus semilaevis [70], the male maps were slightly larger (1.03 and 1.09 times, respectively).

Inconsistencies between genetic and physical mapping

Our linkage map was successfully used as a reference to anchor the genome scaffolds into a chromosome-level assembly, underscoring its high quality. The chromosome-level genome, anchored using the linkage map, showed a high correspondence with the reference genome scaffolded using Hi-C physical mapping. These results are consistent with those obtained in other chromosome-level genomes anchored with linkage maps, such as in A. mexicanus [47] and Sander lucioperca [73]. The inconsistencies revealed by structural differences, such as relocations and inversions, were also reported in the Lake Trout Salvelinus namaycush [74]. These inconsistencies will likely require further investigation using additional techniques, such as physical mapping of specific DNA sequences onto chromosome spreads with FISH.

Sex chromosome characterization

Due to the complexity of sex-determining regions (SDRs) and the reduced sequencing coverage in XY or ZW genotypes compared to autosomes, assembling sex chromosomes is significantly more challenging than assembling autosomes [75]. As a result, sex chromosomes are often the least well-assembled and annotated regions [26]. Accurate phased assembly of sex chromosomes requires an additional analysis process to separate the contigs (Z or W reads) belonging to the non-recombining region of the sex chromosomes [75]. Therefore, fish sex chromosome assemblies generally rely on a previously established genome of the homogametic sex, which contains the sex chromosome, as a reference to identify the sex-linked sequences [27]. In this case, an initial assembly is performed with long reads, and the resulting contigs are aligned to the previously established reference genome and sex chromosome. Subsequently, based on empirically established thresholds for parameters such as alignment percentage and similarity, the contigs linked to the W/Y and Z/X chromosomes are identified. For example, in the assembly of the stickleback Y chromosome [29], the putatively Y-linked contigs were identified as those that aligned only partially (<25% of the contig length) or did not align at all to the female reference genome, or those that aligned to the reference X chromosome (>25% of the contig length) but with greater sequence divergence. Conversely, the putatively X-linked contigs were identified as those that aligned to the reference X chromosome (>25% of the contig length) but with lower sequence divergence (greater than 96% similarity). After selecting the contigs linked to each sex chromosome, they can be ordered and grouped separately using scaffolding techniques [29], or the long reads associated with these contigs can be extracted and assembled separately using stricter assembly parameters [76].

An alternative approach to separately assembling the Z and W chromosomes involves leveraging information from the linkage groups. Since LG24 contains markers consistent with a Z-chromosomal inheritance pattern, where female offspring are homozygous for one of the father’s alleles, it is theoretically possible to isolate Z reads for further assembly. However, in our study, we concluded that this approach would still likely result in spurious assemblies of the Z and W chromosomes. The primary issue is the limited number of markers in LG24 (only 225, as shown in Supplementary Table S5), which is insufficient to robustly support the Z chromosome assembly. This limitation resulted in a relatively small chromosome (∼4 Mb), after scaffolding with Chromonomer [44] (Supplementary Table S5), whereas a chromosome region of approximately 17 Mb had been expected.

In the absence of a reference, a k-mer approach, as used in Verasper variegatus [33], could be applied to assemble the sex chromosomes separately. This method uses resequencing data from both males and females to identify female-specific k-mers. Once these k-mers are identified, female-specific data are filtered from the PacBio HiFi and Hi-C sequencing datasets. The HiFi data are then used to assemble candidate W sequences, which are subsequently anchored to chromosomes using the female-specific Hi-C data. While this approach has proven effective, it requires highly accurate long reads, such as HiFi data, which were unavailable for our study.

Despite significant advances in genome sequencing and assembling techniques, a major challenge persists because genome assembly algorithms are not typically optimized for sex chromosomes [75]. In this study, we faced this issue, as accurate and phased assembly of the Z and W chromosomes was not possible. Current PacBio HiFi assembly algorithms are created to phase structurally similar autosomes into distinct allelic haplotypes. However, sex chromosomes often deviate from this pattern due to their potential heteromorphy, which can involve differences in size, gene content, repeat content, and structural variation between the sex chromosome pairs [75]. For instance, the human Y chromosome has been notoriously challenging to assemble due to its complex repetitive structure [77], resulting in more than 50% of the chromosome missing from the last reference assembly [78]. Only recently, the telomere-to-telomere (T2T) consortium published the complete and gapless sequence of the human Y chromosome [79], using an innovative methodology [79–81]. Briefly, the strategy utilizes assembly string graphs constructed using PacBio HiFi reads, ONT ultra-long reads for resolution of repeats and subgraphs, trio-binned data to patch coverage gaps caused by HiFi sequencing biases, and extensive manual curation. The assembler released with the development of the T2T methodology has enabled the complete assembly of several ape X and Y chromosomes [82, 83].

In our study, we assembled a highly degenerated sex chromosome in a nonmodel species without prior genomic references, utilizing PacBio long reads and Hi-C. The absence of an initial Z chromosome reference, coupled with the use of error-prone PacBio CLR reads, prevented the accurate separation of homologous sequences of Z and W in the non-recombining region of the sex chromosome, as also observed in V. variegatus [33]. Despite the consensus sex chromosome, the integration of different approaches (recombination suppression, coverage, F_ST, and number of SNPs) allowed the identification of 3 Mb of a highly differentiated putative W-specific region, 18 Mb of a region undergoing degeneration, and 24 Mb of the PAR.

Sex chromosome gene repertoire

Genes situated in the non-recombining region—namely, ccdc114, igfbp6, sox13, and wnt4—belong to gene families involved in various developmental processes in fish, including sex differentiation and determination [84]. Despite the relevance of their gene families in sex determination/differentiation, neither ccdc114, igfbp6, nor sox13 are currently recognized to actively participate in these processes or have been identified as master sex-determining (MSD) genes. Wnt4 plays a crucial role in ovarian differentiation and development in mammals. However, the role of wnt4 in teleost fish remains unclear [85]. In tambaqui (C. macropomum), it was related to sex differentiation, either upregulated in female-like individuals or antagonized in male-like individuals [59], suggesting that it could also play a role in sex regulation and dimorphism in piauçu.

Within the recombining region, genes from the TGF-β signaling pathway, such as bmp7 and amhr2, were identified. Members of this signaling pathway have recurrently and independently emerged as MSD genes [86]. Notably, of the 20 distinct MSD genes identified so far, 13 belong to the TGF-β signaling pathway, including amh, amhr2, bmpr1b, gsdf, and gdf6.

Bone morphogenetic proteins (BMPs) are implicated in mammalian germ cell specification and gametogenesis [87]. Recently, a truncated form of a BMP type I receptor, BMPR1BB, was identified as the MSD gene in Atlantic herring [30]. While bmp7 has not been identified as a candidate sex-determining gene in any species thus far, it has been linked to sex differentiation processes in mouse embryos [88] and fish [89], requiring further studies to understand its role in sex determination in piauçu. Amhr2, the anti-Müllerian hormone receptor, has been co-opted as an MSD gene in some fish species [90–92]. Due to the relevance of the amh/amhr2 pathway in sex determination, especially in fish, we highlight amhr2 as another candidate for sex determination in piauçu. We hypothesize that a long-distance receptor located in the non-recombining region is inhibiting amhr2 transcription, directing the sex fate toward females. A similar mechanism was reported in the Amami spine rat [93].

Methods

Chromosome-level genome

Tissue samples for genome sequencing were obtained from an adult ZW female M. macrocephalus from the broodstock of the Aquaculture Center of São Paulo State University. To confirm the genotype of the individual, we performed cytogenetic analysis using the lymphocyte culture technique described by [94] with some adjustments and C-banding according to [95] (Supplementary Fig. S5).

High molecular weight (HMW) DNA was extracted from blood using the Nanobind CBB Big DNA Kit (Circulomics) to generate long reads, and a CLR library was constructed using the SMRTbell Express Template Prep Kit 2.0. The library was sequenced in 1 single-molecule real-time (SMRT) cell on the PacBio Sequel II System (RRID:SCR_017990). All these steps were performed by the Genomics & Cell Characterization Core Facility (GC3F) of the University of Oregon (USA).

To improve the accuracy of the long reads, a short read library was produced using the MGIEasy PCR-Free Library Prep Set (MGI Tech Co., Ltd.) and sequenced on BGI DNBSEQ-G400 (RRID:SCR_017980) 150-bp paired-end reads at the BGI Genomics facility. Subsequently, to merge the scaffolds into putative chromosomes, a chromatin interaction (Hi-C) library was generated using the Proximo Hi-C Library Prep Kit (Phase Genomics) with in vivo cross-linking at the Genomic Sciences Laboratory of the North Carolina State University (USA). Sequencing was performed on Illumina NovaSeq 6000 (RRID:SCR_016387) 150-bp paired-end reads.

Genome size estimate

The short reads were used to estimate the haploid genome size, rate of heterozygosity, and abundance of repetitive elements. First, the reads were trimmed with Trimommatic [96] to remove bases with an average quality of less than 20 within a sliding window of 4 bp and bases with quality less than 20 at the beginning and the end of the reads. Reads shorter than 36 bp were also discarded. After filtering, Jellyfish (RRID:SCR_005491) [97] was used to count canonical k-mers (-C flag) of lengths ranging from 21 to 24. The resulting k-mer profile was then loaded into GenomeScope (RRID:SCR_017014) [98] for analysis.