Abstract
Y chromosomes of great apes harbor Ampliconic Genes (YAGs)—multi-copy gene families (BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY) that encode proteins important for spermatogenesis. Previous work assembled YAG transcripts based on their targeted sequencing but not using reference genome assemblies, potentially resulting in an incomplete transcript repertoire. Here we used the recently produced gapless telomere-to-telomere (T2T) Y chromosome assemblies of great ape species (bonobo, chimpanzee, human, gorilla, Bornean orangutan, and Sumatran orangutan) and analyzed RNA data from whole-testis samples for the same species. We generated hybrid transcriptome assemblies by combining targeted long reads (Pacific Biosciences), untargeted long reads (Pacific Biosciences) and untargeted short reads (Illumina)and mapping them to the T2T reference genomes. Compared to the results from the reference-free approach, average transcript length was more than two times higher, and the total number of transcripts decreased three times, improving the quality of the assembled transcriptome. The reference-based transcriptome assemblies allowed us to differentiate transcripts originating from different Y chromosome gene copies and from their non-Y chromosome homologs. We identified two sources of transcriptome diversity—alternative splicing and gene duplication with subsequent diversification of gene copies. For each gene family, we detected transcribed pseudogenes along with protein-coding gene copies. We revealed previously unannotated gene copies of YAGs as compared to currently available NCBI annotations, as well as novel isoforms for annotated gene copies. This analysis paves the way for better understanding Y chromosome gene functions, which is important given their role in spermatogenesis.
Introduction
Sex chromosomes originated from autosomes independently in different taxa during evolution (Charlesworth 1996). The mammalian proto-Y chromosome gained a male-determining gene, SRY. This event resulted in the suppression of recombination with the proto-X chromosome followed by the accumulation of deleterious mutations and degradation on the Y chromosome. As a result, the Y lost most of its genes (Skaletsky et al. 2003; Waters et al. 2007) whereas the X chromosome largely preserved its original gene content (Mueller et al. 2013; Miga et al. 2020).
The human Y chromosome consists of two pseudoautosomal regions (PARs), which still recombine with the PAR regions of the X chromosome, and a large male-specific region (the MSY) (Bhowmick et al. 2007). The MSY contains three categories of regions: X-transposed (acquired through a recent duplication from the X chromosome), X-degenerate (originating from proto-sex chromosomes), and ampliconic (Skaletsky et al. 2003). The Y chromosome ampliconic genes (YAGs), which are located in the ampliconic region, are expressed in the testis, encode proteins functioning in spermatogenesis, and affect male fertility (Skaletsky et al. 2003; Ross et al. 2005). YAGs form families with high sequence identity among gene copies (Bhowmick et al. 2007). Although ampliconic genes were originally characterized as genes located exclusively on the Y chromosome, later work revealed the presence of their homologs on the X chromosome and autosomes (Vallender and Lahn 2004; Bhowmick et al. 2007). Indeed, some YAGs were shown to originate as a result of duplication of autosomal copies (CDY, DAZ) or of proto-X/Y gene pairs (HSFY, VCY, XKRY, RBMY, and TSPY). BPY2 (also known as VCY2) originated as a result of the transposition of a non-functional homolog BEYLA on chromosome 8 to the ampliconic region of the Y chromosome (Cao et al. 2015). The origin of PRY remains unclear (Bhowmick et al. 2007).
In human, YAGs include nine gene families (Table 1): BPY2, CDY, DAZ, HSFY, PRY, RBMY, TSPY, VCY, and XKRY (Skaletsky et al. 2003). Except for TSPY, whose copies are organized in tandem repeats, human YAGs are located in palindromes, facilitating gene conversion (Skaletsky et al. 2003). YAG copies positioned at symmetrical arms of the same palindrome are nearly identical (Hallast et al. 2013). However, copies located at nonsymmetrical locations within one palindrome, as well as copies located at different palindromes, vary in sequence identity (Bhowmick et al. 2007). Copy numbers of YAGs vary substantially among species and even among individuals of the same species (Vegesna et al. 2019, 2020). Some YAG families, for example, PRY, underwent pseudogenization (Tomaszkiewicz et al. 2023).
Table 1.
The Y chromosome ampliconic gene (YAG) families and functions for nine known YAG families
| Ampliconic gene family | Full name | Function | Protein length in amino acids (species; accession number) |
|---|---|---|---|
| BPY2 (VCY2) | Testis-Specific Basic Protein, Y-linked | Unknown; potentially involved in the cytoskeletal network (Wong et al. 2004); potentially involved in male germ-cells development and male infertility (Lahn and Page, 1997) | 106 (human; NP_004669.2) |
| CDY | Testis-Specific Chromodomain Protein, Y-linked | Histone acetyltransferase (Lahn et al. 2002), regulates gene expression and post meiotic nuclear remodeling (Caron et al. 2003; Navarro-Costa et al. 2007) | 541 (Gorilla; JAC06559), 540 (Human; NP_001003894), 470 (Human; XP_054184366), 541 (Chimpanzee; NP_001233424) |
| DAZ | Deleted in in Azoospermia | Translational activator involved in spermatogonia differentiation (Schrans-Stassen et al. 2001), onset and progression of meiosis (Lin et al. 2008) through mRNA transportation and enhancement of translation (Collier et al. 2005); potentially regulates mRNA stability (Takeda et al. 2009). More potential functions are summarized in_(VanGompel and Xu 2011) | 390 (Human; NP_001375425.1), 744 (Human; NP_004072.3), 366 (Human; AAB02393.1) |
| HSFY | Heat Shock Transcription Factor, Y-Linked | Unknown, review in (Kovács et al. 2022); demonstrated transcription factor activity (Gaudet et al. 2011) and possibly regulate gametogenesis (Tessari et al. 2004; Widlak and Vydra 2017) | 401, 203, and 213 (human; Tessari et al., 2004) |
| PRY | Testis-Specific PTP-BL-Related Y; PTPN13 Like, Y-Linked | Possible role in germ cell apoptosis (Stouffs et al. 2004) | 147 (Sumatran orangutan; AJA38019.1), 143 (Gorilla; AJA38011.1), 147 (Human; NP_004667.2), 68 (Chimpanzee; PNI88546.1) |
| RBMY | RNA Binding Motif Protein, Y-Linked | Regulates nuclear splicing in the germline by regulation of expression of splicing factors (Dreumont et al. 2010) | 356 (Human; NP_001307874.1), 459 (Human; NP_001307873.1), 496 (Human; NP_005049.1) |
| TSPY | Testis Specific Protein, Y-Linked | Regulation of spermatogonial cell replication and renewal; possible role in spermatogenesis as a catalyst/meiotic factor (Lau et al. 2011) | 262 (Gorilla; JAC06569.1), 308 (Human; NP_003299.2), 294 (Human; NP_001184171.1), 221 (Human; NP_001307893.1) |
| VCY | Variable Charge, Y-Linked | Regulation of ribosome assembly in the male germline through modulation of expression of a regulator of ribosomal assembly (Zou et al. 2003) | 125 (human; Lahn & Page, 2000) |
| XKRY | XK Related, Y-Linked | Multipass membrane transport protein involved in fertilization process (Lahn and Page 1997; Mahanta et al. 2012) | 159 (Human; AAC51844.1) |
The telomere-to-telomere (T2T) assembly of the human Y chromosome, generated with the use of long-read sequencing technologies (Pacific Biosciences, PacBio, and Oxford Nanopore Technologies), has allowed to resolve its complete repetitive structure, including YAG family copies (Rhie et al. 2023). The T2T Y and X chromosome assemblies of five non-human great ape species (bonobo, chimpanzee, gorilla, Sumatran orangutan, and Bornean orangutan) have just been generated (Makova et al. 2024) with the same approaches and their YAG families have been completely deciphered. There is currently an urgent need to characterize the function of genes on the Y chromosome of great apes, as many of them are important for reproduction. Studying the transcriptome of YAGs is an important step in this direction.
Technological advances in sequencing technologies, the invention of long-read sequencing in particular, revealed the tremendous diversity of the transcriptome (Ren et al. 2023), which translates into proteome diversity (Park et al. 2018; Soto et al. 2019). Transcriptome diversity originates from alternative splicing and gene duplication with subsequent accumulation of mutations in duplicate copies (Goldtzvik et al. 2023). In the human genome, >90% of genes undergo alternative splicing (Wang et al. 2008). The contribution of gene duplication to isoform diversification is particularly important to decipher for YAGs due to their high copy number. Long sequencing reads, frequently spanning full-length transcripts, are essential to capture the differences between transcripts originating from highly similar copies within each YAG family.
Below we utilized the recently generated T2T assemblies of the Y chromosomes of great ape species (Rhie et al. 2023; Makova et al. 2024) to study transcriptome diversity of their YAGs. We used testis samples of six great ape species (human, bonobo, chimpanzee, gorilla, Bornean orangutan, and Sumatran orangutan) and assembled their transcriptomes using data generated with three different approaches: (1) targeted (YAG-specific) PacBio IsoSeq, in which individual YAG families were captured with sequence-specific probes, (2) untargeted (i.e. genome-wide) PacBio IsoSeq, and (3) untargeted Illumina short-read sequencing. The combination of long reads with high-quality short reads and T2T genome assemblies has allowed us to characterize transcript YAG isoform diversity in great apes in detail. In particular, we determined whether the observed diversity of individual transcripts is explained by alternative splicing or by differences between gene copies. The results of this study provide important information about the evolution of MSY of great apes, all of which are endangered species (except for human). They also represent a useful resource for functional studies aiming to characterize cellular functions of YAGs at the isoform level, which remain unknown to date.
Results
Combination of sequencing approaches yields a comprehensive list of YAG transcripts
To characterize transcript isoform diversity of YAGs in great apes, we combined targeted and untargeted long-read PacBio IsoSeq sequencing approaches with the untargeted short-read Illumina sequencing of RNA isolated from testis samples of six great ape species—bonobo, chimpanzee, gorilla, Bornean orangutan, Sumatran orangutan, and human (for human only targeted IsoSeq data were analyzed). Targeted PacBio IsoSeq data were available for two technical replicates (Tomaszkiewicz et al. 2023). For our analysis, we took advantage of the T2T assemblies available for great ape Y chromosomes (Rhie et al. 2023; Makova et al. 2024). Briefly, we performed reference-based assemblies of transcripts using StringTie (Shumate et al. 2022) and used short-read sequencing data to perform hybrid transcriptome assemblies (StringTie mix mode) to improve the quality of assembled transcripts (Shumate et al. 2022). For each non-human great ape species, we obtained three sets of YAG transcripts: those resulting from technical replicate 1 for targeted data, from technical replicate 2 for targeted data, and from untargeted data. For human, we obtained two sets of YAG transcripts—using technical replicates 1 and 2 of targeted data.
The transcripts resulting from two technical replicates of targeted data were in good agreement with each other (Fig. S1). The number of identical transcripts between replicates (across all gene families) varied from 59 for gorilla to 104 for bonobo. Compared to the targeted approach, the untargeted approach captured fewer YAG transcripts, likely due to the low level of expression of YAGs (Vegesna et al. 2019). For example, for bonobo, we assembled 140 and 138 YAG transcripts using the targeted approach (for technical replicate 1 and technical replicate 2, respectively) and only 31 YAG transcripts using the untargeted approach (Fig. S1). Only a small fraction of transcripts assembled using the untargeted approach were absent from those assembled using the untargeted approach. For example, in bonobo, among 31 transcripts assembled from untargeted IsoSeq reads, 30 were also found in transcripts assembled from targeted IsoSeq reads. However, for other species, untargeted IsoSeq data still captured some YAG transcripts that were not present in the targeted IsoSeq YAG transcript set. For downstream analysis, we merged the three transcript sets (targeted technical replicate 1, targeted technical replicate 2, and untargeted) to create a non-redundant comprehensive set of YAG transcripts for each species. YAGs transcripts assemblies of great ape species are presented as additional files 1—6. The transcript was called to be replicate-supported if it was present in at least two of the three transcript sets.
Identification of novel YAG copies and isoforms
We performed YAG transcript structure classification against the available NCBI reference gene annotations with SQANTI3 (Pardo-Palacios et al. 2023). Among the assembled YAG transcripts, the majority originated from annotated gene copies represented by “full splice match” (FSM; similar to a transcript in the reference annotation) and “incomplete splice match” (ISM; similar to the reference, but containing fewer 5’ exons) categories (Fig. S2). For all species, the largest proportion of transcripts was classified as FSM, suggesting a good agreement of assembled transcripts with reference annotations. However, we identified novel isoforms of annotated genes for all species. New isoforms were classified into two categories: “novel in catalog” (NIC; transcripts with splice junctions between annotated donor/acceptor sites) and “novel not in catalog” (NNIC; transcripts with at least one novel donor or acceptor site). For all species, there were more NNIC than NIC transcripts.
Besides new transcripts, SQANTI3 identified potentially new gene copies for all species (Fig. S2). These included both coding and non-coding copies (transcribed pseudogenes) and suggested under-annotation of the reference assemblies. We refined the discovery of novel genes by SQUANTI3 by strictly excluding any overlap with annotated genes (see “Annotation of transcripts” in the Methods) and confirmed the presence of novel gene copies in the reference assemblies. Our estimates of novel gene copy numbers (Fig. 1A) were lower than those identified by SQANTI3 (category “Intergenic” in Fig. S2), suggesting that SQANTI3 overestimates the number of new copies. We present numbers for both replicate-supported and not replicate-supported datasets (Fig. 1A). Although replicate-supported transcript assemblies serve as more reliable evidence of novel gene copies, we found several cases when transcripts supported by a single technical replicate indicated the presence of additional gene copies (Fig. 1B). Thus, data not supported by replicates could also be useful to improve existing annotations and identify novel gene copies.
Figure 1.
Previously unannotated YAG copies identified from mapping the assembled YAG transcripts to the reference T2T genomes of great apes. A. Potential new gene copies with at least one replicate-supported transcript mapped (transparency differs between replicate-supported and not replicate-supported new gene copies).. B. An example of a potential new BPY2 copy with seven transcripts (all not replicate-supported) mapping to it.
We next examined whether distinct isoforms originate from different gene copies or are produced as a result of alternative splicing from one gene copy (Fig. S3). We limited our analysis to (potential) protein-coding copies. Our results suggest that the majority of such copies encode multiple protein isoforms (237 and 143 gene copies encoding multiple and single isoforms across all species, respectively; Fig. S3). Many gene copies produced only two or three isoforms. However, for some gene families, e.g., DAZ and BPY2, we identified gene copies producing more than ten isoforms (Fig. S3). For example, a single DAZ gene copy in bonobo produced 11 different isoforms (Fig. S4). Gene families with a low abundance of protein-coding transcripts (VCY and XKRY) had gene copies each producing only one isoform.
X chromosome and autosomal homologs of YAGs in great apes
We identified some transcripts originating from homologs of YAGs on the X chromosome and/or on the autosomes (Fig. 2B). VCY had transcripts originating from X chromosomal homologs in all species and additionally from autosomal homologs in chimpanzee and gorilla. HSFY had transcripts originating from autosomal homologs in all species and X chromosomal homologs in gorilla and bonobo. We identified transcripts originating from autosomal homologs of BPY2 in bonobo and orangutans; XKRY from bonobo and chimpanzee; PRY from orangutans. We detected RBMY transcripts originating from autosomal copies in gorilla and chimpanzee and X chromosomal homologs in bonobo. PRY had transcripts originating from an autosomal homolog in both species of orangutan and in gorilla. We did not detect any CDY, DAZ, and TSPY transcripts that originated from non-Y chromosome copies.
Figure 2.
A. Overview of the presence of genes and transcribed pseudogenes for nine Y chromosome ampliconic gene families in great ape species based on the identified replicate-supported transcripts. A gene family in a species was marked as “protein-coding” if at least one replicate-supported transcript was identified and as “transcribed pseudogene” if no replicate-supported transcripts with high-confidence cORFS were found (but transcripts without open reading frames were identified). B. Overview of the presence of transcripts originating from X-chromosome and autosome homologs for nine ampliconic gene families in great ape species. These data were derived from the transcriptome assembly of targeted IsoSeq data only.
Transcripts vary in numbers and length across YAG families and species
In total, we identified 1,266 unique transcripts mapping to the Y chromosome across all species and YAG families (Table S2), among which 445 were replicate-supported (Table S3). On average, 211 (74 replicate-supported) transcripts were assembled per species with a minimum of 132 for chimpanzee (61 replicate-supported) and a maximum of 307 for human (93 replicate-supported). The numbers and the lengths of transcripts differed substantially among species and YAG families (Table S2 and Fig. S5). Across the data set, we found an abundance of RBMY (374 transcripts in total, 159 replicate-supported) and TSPY (230 transcripts in total, 77 replicate-supported) transcripts. In contrast, VCY had only 9 transcripts (all replicate-supported) across all species. The mean length of the assembled transcripts across all gene families and species was 1,937 bp (standard error, SE = 41). The longest assembled transcript was 9,838 bp and belonged to the HSFY gene family from Bornean orangutan. The shortest transcript was 202 bp and originated from a copy of the XKRY gene from Bornean orangutan. When we averaged transcript lengths per family (and considered all species together), PRY had the longest transcripts (mean length = 3,416 bp, SE = 268, range = 633‒8,894) and XKRY the shortest (mean length = 512 bp, SE = 34, range = 202‒896).
Coding potential of assembled YAG transcripts
To characterize the coding potential of assembled transcripts, we first used the threshold of a minimum of 50 amino acids (aa) to consider an ORF a complete ORF (cORF). Following Tomaszkiewicz et al. (2023), we next identified high-confidence cORFs as any cORFs that had significant homology to the known sequences of great ape YAG proteins (see “Identification of high-confidence complete open reading frames” in Methods). We used BLASTP to align all cORFs found in a transcript to the known protein sequences of great ape YAG proteins from the NCBI protein database (Fig. S6). For the majority of alignments, coverage was above 80% and sequence identity was above 70%. The alignments of DAZ, RBMY, TSPY, and CDY transcripts demonstrated a large variation in sequence identity (60–100%; Fig. S6A). Across all species and gene families, we identified 1063 transcripts with high-confidence cORFs (mean length = 267 aa, SE = 6, range = 50‒951; Table 2, Fig. 3A) among which 445 (42%) were replicate-supported (mean length = 315 aa, SE = 9, range = 50‒833; Table S4). On average We identified an average of 177 transcripts (74 replicate-supported) with high-confidence cORFs per species. The majority of high-confidence cORFs originated from RBMY, DAZ, TSPY, and CDY gene families. VCY and XKRY had the smallest number of protein-coding transcripts.
Table 2.
The descriptive statistics for predicted YAG proteins across great ape species.
| Species | YAG family | N | Length, bp | S.E. of length | Min length, bp | Max length, bp | |
|---|---|---|---|---|---|---|---|
| Bonobo | CDY | 4 | 162 | 554.50 | 13.50 | 595 | 541 |
| DAZ | 23 | 420.00 | 17.10 | 611 | 318 | ||
| HSFY | 1 | 130.00 | NA | 130 | 130 | ||
| PRY | 8 | 71.00 | 0.76 | 73 | 69 | ||
| RBMY | 88 | 392.18 | 16.61 | 497 | 87 | ||
| TSPY | 36 | 203.06 | 11.38 | 264 | 76 | ||
| VCY | 2 | 212.00 | 16.00 | 228 | 196 | ||
|
| |||||||
| Bornean orangutan | CDY | 47 | 146 | 193.47 | 25.41 | 601 | 56 |
| DAZ | 34 | 542.71 | 40.92 | 951 | 128 | ||
| HSFY | 12 | 284.42 | 35.85 | 404 | 88 | ||
| RBMY | 30 | 304.07 | 24.44 | 497 | 66 | ||
| TSPY | 19 | 175.63 | 15.58 | 264 | 79 | ||
| XKRY | 4 | 86.50 | 17.91 | 140 | 64 | ||
|
| |||||||
| Chimpanzee | BPY2 | 3 | 119 | 95.00 | 11.00 | 106 | 73 |
| CDY | 9 | 559.00 | 9.00 | 595 | 541 | ||
| DAZ | 35 | 477.57 | 18.07 | 720 | 342 | ||
| HSFY | 9 | 108.67 | 7.21 | 134 | 78 | ||
| PRY | 5 | 71.40 | 0.98 | 73 | 69 | ||
| RBMY | 41 | 372.22 | 22.60 | 497 | 87 | ||
| TSPY | 14 | 200.57 | 20.72 | 264 | 84 | ||
| VCY | 2 | 124.00 | 0.00 | 124 | 124 | ||
| XKRY | 1 | 70.00 | NA | 70 | 70 | ||
|
| |||||||
| Gorilla | BPY2 | 28 | 185 | 76.46 | 2.78 | 104 | 68 |
| CDY | 22 | 237.36 | 37.79 | 557 | 65 | ||
| DAZ | 32 | 422.97 | 42.27 | 850 | 173 | ||
| HSFY | 47 | 140.77 | 7.15 | 228 | 57 | ||
| PRY | 2 | 73.00 | 0.00 | 73 | 73 | ||
| RBMY | 41 | 221.61 | 19.72 | 460 | 81 | ||
| TSPY | 12 | 174.58 | 15.53 | 264 | 96 | ||
| VCY | 1 | 107.00 | NA | 107 | 107 | ||
|
| |||||||
| Human | BPY2 | 48 | 288 | 89.44 | 2.25 | 106 | 73 |
| CDY | 14 | 349.29 | 57.14 | 554 | 102 | ||
| DAZ | 50 | 402.74 | 19.27 | 744 | 121 | ||
| HSFY | 8 | 238.25 | 50.95 | 401 | 57 | ||
| PRY | 14 | 73.57 | 1.88 | 98 | 71 | ||
| RBMY | 33 | 333.88 | 27.43 | 496 | 81 | ||
| TSPY | 118 | 168.71 | 9.41 | 314 | 50 | ||
| VCY | 2 | 125.00 | 0.00 | 125 | 125 | ||
| XKRY | 1 | 117.00 | NA | 117 | 117 | ||
|
| |||||||
| Sumatran orangutan | CDY | 65 | 163 | 271.22 | 26.34 | 601 | 63 |
| DAZ | 4 | 549.25 | 108.65 | 741 | 330 | ||
| HSFY | 40 | 175.40 | 18.16 | 404 | 63 | ||
| RBMY | 25 | 362.92 | 20.39 | 497 | 114 | ||
| TSPY | 27 | 173.04 | 12.41 | 290 | 90 | ||
| XKRY | 2 | 71.00 | 0.00 | 71 | 71 | ||
|
| |||||||
| All species | BPY2 | 79 | 85.05 | 1.86 | 106 | 68 | |
| CDY | 161 | 273.81 | 16.49 | 601 | 56 | ||
| DAZ | 178 | 453.35 | 13.51 | 951 | 121 | ||
| HSFY | 117 | 171.44 | 9.46 | 404 | 57 | ||
| PRY | 29 | 72.45 | 0.95 | 98 | 69 | ||
| RBMY | 258 | 341.36 | 9.56 | 497 | 66 | ||
| TSPY | 226 | 177.57 | 5.83 | 314 | 50 | ||
| VCY | 7 | 147.00 | 17.31 | 228 | 107 | ||
| XKRY | 8 | 84.38 | 9.91 | 140 | 64 | ||
|
| |||||||
| All species | All YAGs | 1063 | 266.76 | 5.68 | 951 | 50 | |
Figure 3.
A. Total number of high-confidence cORFs per YAG family per species. B. Lengths of replicate-supported high-confidence cORFs and their comparisons to the lengths of known proteins (present in the NCBI protein database).
We compared the lengths of high-confidence cORFs with the lengths of previously published protein sequences when such sequences were available in the NCBI protein database (Fig. 3B and Table 1). Overall, we found the lengths of the predicted and previously reported proteins to be in good agreement. For example, a significant proportion of the identified BPY2 protein-coding transcripts encode 106 aa long proteins matching the length of the previously reported BPY2 protein isoform. In agreement with previously published data (Table 1), in many cases we predicted protein isoforms of different lengths for the same gene family (Fig. 3B). For example, the lengths of CDY proteins have two distinct groups roughly matching the lengths of previously reported protein isoforms (470 and 540 aa). Similarly, the HSFY gene family with lengths of predicted isoforms group around previously reported lengths of HSFY proteins (323 and 401 aa). The remaining predicted HSFY proteins had lengths between 50 and ~250 aa. Interestingly, the two longer (323 and 401 aa) HSFY isoforms were identified in human and orangutans, whereas the shorter (50–250 aa) isoforms were found in gorilla, bonobo, chimpanzee, and human, suggesting the presence of species-specific isoforms. The RBMY, CDY, TSPY, and DAZ gene families, known to have numerous protein-coding copies across great ape species, demonstrated a large variation in the length of predicted proteins.
The characterization of the coding potential of YAG transcripts allowed us to differentiate between genes and transcribed pseudogenes. We identified that CDY, DAZ, RBMY, TSPY, and HSFY had multiple gene copies in all species. Except for CDY, we also identified numerous transcribed pseudogenes in these gene families (Fig. 4). BPY2, XKRY, VCY, and PRY were represented by a smaller number of gene copies, with protein-coding copies missing in some species (Fig. 4).
Figure 4.
Summary of the number of identified gene copies and transcribed pseudogene copies of YAGs in great ape species. The transcribed pseudogenes are shown at the lower part of each graph.
Diversity of structural and sequence isoforms of YAGs across species
To characterize the isoform diversity of YAG transcripts across great ape species, for each gene family, we mapped transcripts from all species to a single human protein-coding gene copy (Tomaszkiewicz et al., 2023). Structural isoforms were identified by comparing the exon-intron structure of transcripts within and across species. We limited our analysis to transcripts with high-confidence cORFs. Sequence isoforms were identified by comparing transcript protein sequences within and across species. This part of the analysis was limited to replicate-supported transcripts with high-confidence cORFs.
For every gene family, we observed a high diversity of structural and sequence isoforms (Fig. 5; the exon-intron structure of each transcript is presented in Figs. S12–S20). RBMY had the highest number of unique sequence isoforms (n = 87) and, together with DAZ, had the highest number of structural isoforms (n = 40; Fig. 5A). The lowest number of sequence isoforms was identified for BPY2 (n = 4) and the lowest number of structural isoforms was identified for VCY (n = 2). For seven out of nine gene families (CDY, HSFY, PRY, RBMY, TSPY, VCY, DAZ), the diversity of sequence isoforms was higher than the diversity of structural isoforms (Fig. 5B). For XKRY, the number of unique structural and sequence isoforms was equal. For BPY2, the diversity of structural isoforms was higher than the diversity of sequence isoforms.
Figure 5.
Characterization of YAG transcript isoforms diversity of great apes. Structural isoforms (differ in exon-intron structure, in orange) and sequence isoforms (differ in sequence, in blue) are shown. A. Total number of unique structural and sequence isoforms per gene family. B. Diversity measure (inverse Simpson; Simpson 1949) of transcript isoform diversity per gene family for each species.
We identified isoforms shared between species and isoforms unique to each species (Figs. 20–22). HSFY and TSPY had structural isoforms shared across all species that have protein-coding copies of these genes, and CDY had a structural isoform that was shared by five out of six species (Fig. 6). Bornean and Sumatran orangutans had shared isoforms at both structural and sequence levels for each gene family (except for DAZ that was not observed in Sumatran orangutan; Fig. 6). Chimpanzee and bonobo shared common structural, but not sequence, isoforms for CDY, PRY, VCY, and RBMY, and common sequence and structural isoforms for TSPY and DAZ (Fig. 6). Interestingly, TSPY demonstrated remarkable conservation at the level of transcript structure with the majority of transcripts presented by a single isoform shared across all species (Fig. 6).
Figure 6.
Sequence and structural transcript diversity for CDY, TSPY, and DAZ gene families across great ape species. Species are shown on the y-axis, isoform (structural in red and sequence in blue) ids on the x-axis; the number on the intersection corresponds to the number of transcripts representing a given isoform; data are presented for replicate-supported isoforms only. Figs. S21–S23 have similar plots for BPY2, HSFY, PRY, RBMY, VCY, and XKRY gene families.
Finally, we investigated whether transcript isoform diversity, as measured with the Shannon diversity index, is correlated with gene expression levels or gene copy number. We limited our analysis to CDY, DAZ, HSFY, TSPY, and RBMY—gene families that have a large number of protein-coding transcripts and are present in more than three species. Neither structural nor sequence diversity demonstrated a relationship with gene expression that was consistent across gene families (Fig. S24); correlation coefficients were not statistically significant. Gene copy number was mostly positively correlated with sequence isoform diversity in most cases, but only the correlation coefficient for CDY was statistically significant (p = 0.018; Fig. 7). There was no consistent relationship between structural isoform diversity and gene copy number (Fig. S25).
Figure 7.
Relationship between sequence isoform diversity (inverse Simpson) and gene copy number (natural logarithm) across all great ape species per gene family. Black line represents a regression line fitted. Pearson correlation coefficient and p-value are shown at the upper part of each plot, sample size (i.e. the number of species that have a given gene family) is shown at the bottom.
Discussion
To produce a comprehensive catalog of YAG transcripts in great apes, we used a combination of long- and short-read sequencing and took advantage of the recently released T2T assemblies of their sex chromosomes. Long-read sequencing was crucial to differentiate between transcripts originating from highly similar copies present within each YAG family. Additionally, we were able to identify both new gene copies and new isoforms of annotated copies of YAGs. The number of transcripts assembled, as well as their coding potential, varied substantially between species and gene families. Some gene families underwent pseudogenization; in each gene family, we identified at least one transcribed pseudogene. The observed transcript isoform diversity originated from alternative splicing, leading to structural diversity, and gene duplication and divergence, leading to sequence diversity. In general, sequence isoform diversity was higher than structural isoform diversity.
A refined catalog of YAGs in great apes
We present our YAGs transcript assemblies in comparison with previously reported findings. Consistent with Tomaszkiewicz et al. (2023), we report the presence of transcripts from all nine YAG families in all great ape species, except BPY2 in bonobo (Fig. 2A). Vegesna et al. (2020) reported that five gene families (BPY2, CDY, DAZ, RBMY, TSPY) are shared across all species, three (HSFY, PRY, and XKRY) are shared between gorilla, both species of orangutans and human, and VCY is specific to chimpanzee and human. In contrast to Vegensa et al. (2020), we did not detect BPY2 transcripts originating from the Y chromosome of bonobo. However, we identified transcripts from the autosomal homolog(s) of BPY2 (Fig. 2B), which might have been misinterpreted as YAG transcripts by Vegesna et al. (2020) due to the lack of a high-quality reference genome at the time.
Consistent with other studies (Feng et al. 2019; Zhang et al. 2019; Leung et al. 2021), our results suggest that long-read sequencing and reference-based assembly using the T2T assemblies can improve assemblies of transcriptomes. The previous work that characterized transcript isoform diversity of YAGs in great apes used the de novo assembly approach as high-quality references were not available then (Tomaszkiewicz et al., 2023). In contrast, we utilized the recently produced T2T assemblies of Y chromosomes of great apes (Makova et al. 2024). Our results are largely consistent with those of Tomaszkiewicz et al. (2023), however, there are several differences. Most importantly, compared with Tomaszkiewicz et al. (2023), we identified fewer transcripts, and on average they were longer. Indeed, we identified only 445 replicate-supported transcripts, whereas Tomaszkiewicz et al. (2023) identified 1,510. On average, transcripts identified by the de novo assembly approach were more than two times shorter (mean length of 820 bp, Tomaszkiewicz et al., 2023), as compared to those in our reference-based approach (mean length = 1,893,). One possible explanation for this difference is that the de novo approach failed to assemble full transcripts due to the lack of the reference and thus some partial transcripts were considered as different (shorter) transcripts. In the case of PRY, for example, Tomaszkiewicz et al. (2023) assembled 444 replicate-supported transcripts across all species with an average length of <500 bp (Tomaszkiewicz et al., 2023). In contrast, we assembled only 12 replicate-supported transcripts with an average transcript length of 3,865 bp (Table S3).
Another major difference between the de novo and reference-based transcriptome assembly methods is that the de novo method does not differentiate YAG transcripts and transcripts originating from non-Y chromosome homologs of YAGs. This could have also contributed to the overestimation of YAG transcripts. Many YAGs have homologs on the X chromosome or autosomes (Vallender and Lahn 2004; Bhowmick et al. 2007) informing us about their origin (movement from autosomes or originating from the proto-X/Y pair). Bhowmick et al. (2007) reported homologs of HSFY, CDY, XKRY, RBMY, and TSPY on both X chromosome and autosomes, VCY homologs on the X, and DAZ homologs on an autosome. Our results in general are in agreement with these observations, although we could not detect any DAZ, TSPY, and CDY transcripts originating from non-Y chromosomes. This might be explained by the fact that we only used targeted data for this part of the analysis, therefore our hybridization probe did not capture transcripts originating from non-Y chromosome homologs (for example, DAZ homolog on chromosome 3; Saxena et al. 1996).
Isoform diversification of YAGs is explained by both gene duplication and alternative splicing
Our results suggest that both gene duplication and alternative splicing contributed to the observed diversification of isoforms. The potential to produce alternative isoforms from one gene copy, i.e. structural isoforms, varies substantially among YAG families but does not vary substantially among species. In particular, we found that many gene copies of RBMY, DAZ, and HSFY produce several isoforms. The functional effects of the identified protein isoforms of YAGs in great apes remain to be explored, especially in cases when they are species- or genus-specific. In general, alternative splicing has a functional effect on protein isoforms and could affect the localization of proteins, the binding affinity of transcriptional regulators, enzymatic properties, and protein-protein interactions (Kelemen et al. 2013).
Gene duplication and subsequent accumulation of mutations among duplicate copies, together with sequence diversification due to speciation, represent additional mechanisms contributing to generating protein isoforms (Goldtzvik et al. 2023). Duplicate copies mask the potential negative effect of new mutations and thus facilitate the accumulation of genetic diversity playing an important role in evolution. The potential suppression of phenotypic expression of deleterious effects of mutations is particularly important in YAGs, because YAGs encode proteins that have critically important functions in reproduction (Table 1). YAGs undergo species-specific turnover and their gene copy numbers differ substantially between species and even between individuals (Ye et al. 2018; Vegesna et al. 2020). Previous work shed light on the biological meaning of copy number variation of YAGs. For instance, a positive correlation between expression level and gene copy number suggests that copy number variation is among the mechanisms of regulation of expression level on the level of gene family (Vegesna et al. 2020). Our findings complement the previously reported importance of copy number variation of YAG copies and suggest that the expansion of gene families also plays a role in the diversification of protein isoforms.
PRY undergoes pseudogenization
Consistent with previous studies (Tomaszkiewicz et al. 2023), our results suggest the pseudogenization of PRY. For PRY, the majority of the identified gene copies did not encode proteins and were marked as transcribed pseudogenes. Although the average length of transcripts was comparable with the length of transcripts of gene families with protein-coding sequences, we identified only a few high-confidence cORFs for PRY; in orangutans, we did not find any protein-coding copies. We identified several protein-coding PRY copies in human, bonobo, gorilla, and chimpanzee. However, from a comparison of the length of predicted proteins, we found that only a subset of predicted isoforms from human match the length of the previously reported PRY protein (147 aa) whereas the rest of the predicted proteins are more than two times shorter (60–70 aa). These short protein isoforms might represent non-functional truncated proteins and thus protein-coding potential of PRY in non-human apes remains unresolved. Additionally, PRY has the lowest expression across all YAGs in all great ape species (Vegesna et al. 2020), also suggesting that this gene family is on its way to becoming a pseudogene.
Interestingly, in orangutans we did not detect any protein-coding PRY transcripts, however, we found PRY transcripts originating from autosomal PRY homologs in these species. Protein-coding potential of the identified homologs needs further investigation, however, the presence of non-Y homologs might suggest the non-essentiality of Y chromosome PRY copies. Finally, the identified homologs might indicate the origin of PRY since little is currently known about it (Bhowmick et al. 2007).
BPY2 in bonobo and orangutans
Consistent with Tomaszkiewicz et al. (2023), we identified protein-coding BPY2 copies in human, gorilla, and chimpanzee. In human and gorilla, the identified copies encoded multiple structural protein-coding isoforms In chimpanzee we identified one copy producing one isoform and one copy producing two isoforms. Additionally to protein-coding copies, we identified numerous non-coding BPY2 copies in all species, except bonobo. We could not identify high-confidence cORFs in bonobo and the two orangutan species. Possible explanations include (1) pseudogenization of BPY2 on the Y chromosome in these species, (2) low sequencing depth that prevented us from capturing the full repertoire of protein-coding transcripts, (3) independent evolutionary origin of BPY2 in these species (hypothesized by Tomaszkiewicz et al., 2023), which would make the detection of statistically significant homologs unlikely due to high sequence divergence. Low expression levels of BPY2 and the presence of transcribed BPY2 pseudogenes on the Y chromosome in all species favor the hypothesis of BPY2 pseudogenization.
In bonobo and orangutans we identified transcripts originating from chromosome 8 with high sequence similarity to BPY2 transcripts originating from Y chromosome. Our results are in agreement with previous findings (Stuppia et al. 2005; Cao et al. 2015) reporting the presence of a non-functional autosomal BPY2 homolog BEYLA. BPY2 originated as a result of the transposition of a large fragment of BEYLA pseudogene to the amplicon of the Y chromosome. The transposition was followed by an acquisition of protein-coding ability. The absence of functional BPY2 copies on the Y chromosome in bonobo and orangutans could suggest the non-essentiality of BPY2 in these species.
Limitations of the present study and future directions
Our results should be interpreted considering several limitations. The isoforms identified here have reasonable support with some of them found by three independent experiments. However, the fact that in certain species we could not confirm the presence of some gene families or identify protein-coding genes does not necessarily mean that these genes/transcripts are absent. First, ampliconic genes are expressed at low levels (Fagerberg et al. 2014; Vegesna et al. 2019, 2020) and the sequencing depth might have been insufficient to capture the full diversity of YAG transcripts. To overcome this obstacle, targeted IsoSeq data was generated by Tomaszkiewicz et al. (2023) and used in our study. However, the hybridization probes for the targeted approach IsoSeq data were generated prior to the availability of high-quality references. Thus, some transcripts that did not anneal to the designed probes might have not been captured by the targeted IsoSeq approach. We attempted to overcome this potential limitation by complementing the transcript set obtained by targeted IsoSeq with untargeted IsoSeq. Finally, YAGs are expressed in testis at different stages of spermatogenesis and it is possible that in some samples we did not capture the correct stage to obtain a full repertoire of transcripts. The sequencing data used here originate from whole-testis samples without differentiation of different cell types. Thus, we could not investigate the relationship between isoforms and cell types present in the testis. Several studies (Hermann et al. 2018, Grive et al. 2019, Guo et al. 2021, Nie et al. 2022, Zhou et al. 2023) characterized the variability in expression levels at single-cell level in human testis. However, none of them targeted YAGs specifically. Future single-cell studies will provide a more detailed characterization of differential isoform usage across cell types in the testis in human and non-human great apes.
Gene copies within each YAG family are highly similar. We observed that a small proportion of reads mapped non-uniquely, i.e. to several copies of one YAG family. In such cases, it was not possible to unambiguously assign reads to one gene copy. One possible reason for this is that these reads covered transcripts only partially and thus could not capture the small difference between gene copies. This limitation could be potentially overcome with higher sequencing depth. Another potential pitfall lies in the fact that the DAZ gene includes almost identical repeats that could cause alignment issues, leading to artificial transcripts.
We predicted the protein-coding potential of the assembled YAG transcripts based on their homology to known proteins. However, this approach is biased toward protein sequences present in the databases; not all proteins were available for all species studied (Fig. S26). Thus, we could have missed some of the protein-coding transcripts. Alternative, alignment-free coding-potential prediction methods) could verify and potentially expand our findings. Another limitation associated with the NCBI protein database relates to the presence of entries of predicted proteins derived from transcriptomic analyses, alongside authentic protein entries originating from proteomics analysis. The results from transcriptomic studies are less reliable, however, due to the limited availability of the data on YAGs proteins of great apes we still used such entries. We predicted the presence of multiple protein-coding transcript isoforms for YAG families in testis. In general, transcriptomic studies report a high diversity of isoforms in contrast to proteomics studies, which typically report a much lower diversity of isoforms (Goldtzvik et al. 2023). Experimental validation of the predicted protein isoforms, for example using mass spectrometry proteomics (Ferrández-Peral et al. 2022), could verify our results.
Functions of proteins encoded by YAGs remain poorly characterized. Several studies analyzed the effect of deletions of YAG copies on spermatogenesis (Saut et al. 2000; Navarro-Costa et al. 2007) and could only make conclusions about the essentiality of these genes for male fertility. Our dataset with the characterization of isoform diversity provides a valuable resource for subsequent functional characterization of proteins encoded by YAGs. Rastegar et al. (2015) characterized differences in gene expression levels between testis biopsies from men with functional vs. impaired spermatogenesis and identified isoform-level signatures that could serve as markers for diagnosis of azoospermia. Only XKRY, HSFY, PRY, BPY2, and DAZ families in humans were analyzed though. Similar studies with additional YAG families and including non-human great apes could deepen our understanding of their role in ape spermatogenesis. Finally, little is known about several novel YAG families that were recently discovered by Zhou and colleagues (Zhou et al. 2023) and by Makova and colleagues (Makova et al. 2024). Targeted sequencing is required to characterize these gene families in future studies.
Methods
Sequencing data
We analyzed sequencing data previously generated in our laboratory as well as publicly available RNA-seq datasets (summarized in Table S1 and Fig. S7). Sequencing data was generated from testis samples of six great apes species, human, gorilla, chimpanzee, bonobo, Bornean orangutan, and Sumatran orangutan, using three different approaches: untargeted IsoSeq, the standard PacBio protocol that enables to capture full-length RNA isoforms; targeted IsoSeq, a modified IsoSeq protocol that was designed to enrich for transcripts of interest (YAGs, in our case), and short-read Illumina RNA sequencing data. For human, only targeted IsoSeq data was analyzed. For each species only one individual (biological replicate) was analyzed.
For targeted IsoSeq data, the full description of samples and experimental procedures is provided in Tomaszkiewicz et al. (2023). All six testis samples were available in two technical replicates. Hybridization to biotinylated oligo probes, designed specifically to capture YAGs cDNA, was performed to enrich YAG cDNA. Raw PacBio reads from both targeted and untargeted samples were preprocessed using the Iso-Seq3 pipeline (https://github.com/PacificBiosciences/IsoSeq3) to produce full-length non-chimeric (FLNC) reads.
Short-read RNA-seq datasets were obtained from paired-end stranded libraries (except for bonobo with single-end, unstrained library) and sequenced with Illumina MiSeq or HiSeq 2000 (Table S1).
Alignments
For IsoSeq data, FLNC reads were first converted from bam to fasta format using samtools (Danecek et al. 2021). Reads were then aligned to the corresponding full genome (accession numbers are provided in Table S1) using minimap2 (Li 2018) in splice-aware mode (minimap2 -ax splice -uf -C5 genomic.fna input.bam > alignment.sam). The alignments were converted from sam to bam format, sorted, and indexed using samtools. For Illumina data, adapters were removed using Trim Galore (https://github.com/FelixKrueger/TrimGalore). Reads were aligned using HISAT2 (Kim et al. 2015; scripts are available at https://github.com/agreshno/YAGs_great_apes/blob/main/short_read_alignment/trim_galore_hisat2.txt). We then filtered alignments to retain only reads aligned to the Y chromosome (our primary focus).
Transcriptome assembly
We used Illumina data to aid the transcriptome assembly from long PacBio reads. We separately assembled three transcriptomes: untargeted IsoSeq with Illumina short reads and each of two replicates of targeted IsoSeq with Illumina short reads. Reference-based transcriptome assembly was performed using StringTie (Shumate et al. 2022) in mix mode (stringtie -G genomic.gff --mix short_reads.aligned.bam long_reads.aligned.bam -o output.gff).
Comparisons of assemblies
We compared three assembled sets of transcripts using gffcompare (Pertea and Pertea 2020; gffcompare -r reference.gff query.gff). Gffcompare classifies transcripts present in a query gff file into several categories: (1) exact match, (2) unknown (absent from the reference), (3) transcript is not completely matched to the transcript in the reference (further split into 15 different categories that clarify how two transcripts are different, for example retained intron etc.). From these comparisons, we called transcripts replicate-supported if they were present in at least two sets of transcripts (for example, in both technical replicates of targeted or in one technical replicate of targeted IsoSeq and in untargeted IsoSeq) and non replicate-supported if they were present only in one set of transcripts.
For downstream analysis we merged three sets (gffcompare untargeted.gff targeted_1.gff targeted_2.gff -o combined) to compile a non-redundant set of transcripts for each species. As low expression of YAGs (Vegesna et al. 2020) complicates the comprehensive capture of isoform diversity, we included both replicate-supported transcripts and non-replicate-supported transcripts into our analysis, unless specified otherwise. Replicate-supported transcripts are expected to mitigate sequencing errors and potential artifacts introduced during computational inference of transcripts by StringTie. However, this approach inevitably reduces the diversity of identified isoforms.
Quality control with SQANTI3
We used SQANTI3 (Pardo-Palacios et al. 2023) to perform quality control of the assembled set of transcripts and filter out low-quality transcripts (python sqanti3_qc.py combined.gtf reference.gtf reference.fna).
Annotation of transcripts
By default, StringTie annotates transcripts by finding the overlapping genes in the reference annotation for each assembled transcript and then assigning this gene name to the transcript. However, we found that YAGs in the NCBI annotations are under-annotated and the default StringTie method did not work well in our case. Thus, to annotate the transcripts assembled by StringTie we used custom scripts. First, we searched for overlaps between ampliconic genes from the NCBI annotations and the transcripts assembled by StringTie (script annotation_manual_1.R). Second, we used an annotation-free approach and assigned gene families to assemble transcripts based on sequence identity to known transcripts. For this, we used lastz (Harris, 2007) to search for similarities between sequences of assembled transcripts (query sequences) and all known sequences of YAGs transcripts of great apes (reference sequences). To create query sequences, we extracted transcript sequences assembled with StringTie using gffread (Pertea and Pertea 2020). To create reference sequences, we extracted the sequences of individual YAG transcripts based on NCBI annotations for great ape T2T assemblies and united these sequences into a single file (https://github.com/agreshno/YAGs_great_apes/blob/main/Snakemake-Targeted-IsoSeq/all_species_YAGs.fasta). Using lastz (Harris, 2007), we aligned the query sequences against the reference sequences (lastz query.fasta[multiple] reference.fasta --format=general:name1,zstart1,end1,name2,strand2,zstart2+,end2+,id%, blastid% > output.lastz). The gene family of the best matching query sequence from the lastz output was assigned as a reference gene family for transcripts produced with StringTie (script annotation_manual_2.R). To select the best matching sequence from the reference sequences for each query sequence from the lastz output, we filtered the output of the previous step to retain sequences with minimum percent identity of 80% and an alignment length of at least 100 bp and selected the reference sequence that has the highest percent identity and the longest alignment.
Identification of potential complete open reading frames
Following Tomaszkiewicz et al. (2023) to identify potential open reading frames (ORF), we used the getorf package from EMBOS suite (Rice et al. 2000). We used a minimum threshold of 50 amino acids to consider an ORF complete(cORF) (getorf -sequence input.fasta -minsize 150 -find 1 -outseq output.orf). We added this information to the gff files using the script annotation_final.R.
Identification of high-confidence complete open reading frames
Following Tomaszkiewicz et al. (2023) to identify high-confidence cORFs, we searched for homology between products of predicted open reading frames and coding sequences of YAGs of great apes. We created a database of primate YAGs by searching for protein sequences in the NCBI protein database of the gene families of YAGs and by specifying the species of interest (create_protein_db.txt). We used BLASTP (Camacho et al. 2009) to search for homology between predicted complete open reading frames and the database of great ape YAG proteins. We used thresholds for E-values of <10−6 and for bit scores >50 (blastp -query input.fasta -db protein_db/human_apes_YAGs_full -out homologs.txt -evalue 1e-30 -outfmt “6 qseqid sseqid qcovs length pident evalue bitscore mismatch gaps qstart qend sstart send qseq sseq qlen”). The retrieved homologs are statistically significant according to Pearson (2013).
Comparison with reference annotation and isoform classification
We performed isoform classification using SQUANTI3 (Pardo-Palacios et al. 2023) against reference NCBI annotations (Table S1; python sqanti3_qc.py combined.gtf reference.gtf reference.fna).
Alignment to one human YAG copy
In order to compare alternative splicing patterns within one gene family across different species, following Tomaszkiewicz et al. (2023), we aligned all replicate-supported high-confidence complete open reading frames to one of the copies of a given gene family in the human genome using uLTRA (version v0.0.4) (Sahlin and Mäkinen 2021). For this, we (1) filtered the gff file to include only replicate-supported transcripts with high-confidence cORF with the script subset_gtf_hc_cORF.R, (2) extracted fasta sequences of the transcripts with gffread, (3) concatenated fasta files for each gene family across all species, (4) used uLTRA to map transcripts to one copy in the human genome (uLTRA pipeline human_YAGs.fa Human.vcf gene_family.fasta gene_family_ultra/ --t 40), (5) converted sam files to bam format, sorted and indexed them using samtools, (6) transformed sorted bam files to bed files using bedtools, and finally (7) used the script plots_transcripts_all.R to visualize transcripts across different species for each gene family.
Identification of unique structural and sequence isoforms
Structural isoforms.
First, we transformed a bed alignment from uLTRA to gff. From the obtained gff file we extracted information about each transcript into a separate gff file (both of these steps were done with script subset_gff_transcript.R) and compared against the full gff file using gffcompare (gffcompare transcript_1.gff -r gene_family_transcripts.gff -o isoform_1).
Sequence isoforms.
Using script extract_orf_sequences.R we extracted protein sequences encoded by high-confidence cORFs into a separate file for each gene family. For each gene family we built a multiple sequence alignment of protein sequences using MEGA (version 11.0.13; Tamura et al. 2021).
Identification of unique isoforms.
For each gene family we counted the number of unique structural isoforms and unique sequence isoforms and assigned the corresponding isoform id to each transcript (separate scripts for each gene family, for example BPY2_isoform_count.R). To identify unique structural isoforms we used tmap output files produced for each transcript on the step “Structural isoforms”; we assigned the same isoform id to all transcripts that were marked as reference match by gffcompare. To identify unique sequence isoforms we performed all possible pairwise comparisons of the sequences in the multiple sequence alignment file. For each pair of sequences we compared each position in the alignment file (including positions marked as gaps in sequences) and quantified the proportion of identical positions. We assigned sequences that were identical at every position to the same sequence isoform.
Quantification of diversity of isoforms.
To quantify diversity of transcripts produced (script summary.R) we used two measures: (1) isoform count, (2) Sympson inverse diversity index (, where is the proportional abundance of isoform ). The diversity index takes into account the number of identified isoforms and their relative abundances; the inverse diversity index is less biased by the sample size (https://github.com/vegandevs/vegan). All scripts are available in the GitHub folder “isoform_identification”.
Correlation of YAG isoform diversity with expression levels and gene copy number
We investigated the relationship between the measure of diversity of structural and sequence isoforms and gene expression levels (Vegesna et al. 2020) and gene copy number (only protein coding copies were counted) by calculating the pairwise Spearman correlation coefficient for each YAG family (script summary.R).
Full-genome analysis of targeted IsoSeq
We performed additional analysis of targeted IsoSeq data (for all species) by modifying the pipeline in, Fig. S28 (available on GitHub). In this analysis, we assembled transcripts from full-genome alignments (by omitting the filtering for Y chromosome specific alignments). This step was required to detect non-Y chromosome homologs of YAGs and their locations (X chromosome/autosomes). We restricted our analysis to targeted IsoSeq because we expected that the enrichment step will capture not only transcripts that originate from YAGs, but also transcripts from their non-Y chromosome homologs, due to high sequence identity between them (Ahmadi Rastegar et al. 2015).
Analysis of human sequencing data
For human, we analyzed only two technical replicates of targeted IsoSeq. The pipeline was similar to the non-human pipeline, with the exception of the StringTie assembly step (stringtie -G genomic.gff -L long_reads.aligned.bam -o output.gff).
Supplementary Material
ACKNOWLEDGMENTS
We are grateful to Katy Munson and Evan Eichler for generating the untargeted IsoSeq data from the great ape testis samples collected in the Makova lab. This study was supported by grants R01GM130691 and R35GM151945 to KDM.
Footnotes
Code Availability
For non-human great apes, we integrated data from three different sequencing approaches into one pipeline to assemble YAGs transcripts (Fig. 14). The pipeline is available at GitHub as a Snakefile (https://github.com/agreshno/YAGs_great_apes/blob/main/Snakemake-Targeted-IsoSeq/Snakefile).
Data Availability
All datasets used in this study, including untargeted IsoSeq, targeted IsoSeq, and Illumina sequencing data are publicly available and listed in Table S1 along with accession numbers.
References
- Ahmadi Rastegar D., Sharifi Tabar M., Alikhani M., Parsamatin P., Sahraneshin Samani F., Sabbaghian M., Sadighi Gilani M. A., Mohammad Ahadi A., Mohseni Meybodi A., Piryaei A., Ansari-Pour N., Gourabi H., Baharvand H., and Salekdeh G. H. 2015. Isoform-Level Gene Expression Profiles of Human Y Chromosome Azoospermia Factor Genes and Their X Chromosome Paralogs in the Testicular Tissue of Non-Obstructive Azoospermia Patients. J. Proteome Res. 14:3595–3605. American Chemical Society.Bachtrog, D. 2013. Y-chromosome evolution: emerging insights into processes of Y-chromosome degeneration. Nat. Rev. Genet. 14:113–124. Nature Publishing Group. [DOI] [PubMed] [Google Scholar]
- Bhowmick B. K., Satta Y., and Takahata N. 2007. The origin and evolution of human ampliconic gene families and ampliconic structure. Genome Res. 17:441–450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Camacho C., Coulouris G., Avagyan V., Ma N., Papadopoulos J., Bealer K., and Madden T. L. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Charlesworth B. 1996. The evolution of chromosomal sex determination and dosage compensation. Curr. Biol. 6:149–162. Elsevier. [DOI] [PubMed] [Google Scholar]
- Danecek P., Bonfield J. K., Liddle J., Marshall J., Ohan V., Pollard M. O., Whitwham A., Keane T., McCarthy S. A., Davies R. M., and Li H. 2021. Twelve years of SAMtools and BCFtools. GigaScience 10:giab008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fagerberg L., Hallström B. M., Oksvold P., Kampf C., Djureinovic D., Odeberg J., Habuka M., Tahmasebpoor S., Danielsson A., Edlund K., Asplund A., Sjöstedt E., Lundberg E., Szigyarto C. A.-K., Skogs M., Takanen J. O., Berling H., Tegel H., Mulder J., Nilsson P., Schwenk J. M., Lindskog C., Danielsson F., Mardinoglu A., Sivertsson Å., von Feilitzen K., Forsberg M., Zwahlen M., Olsson I., Navani S., Huss M., Nielsen J., Ponten F., and Uhlén M. 2014. Analysis of the Human Tissue-specific Expression by Genome-wide Integration of Transcriptomics and Antibody-based Proteomics *. Mol. Cell. Proteomics 13:397–406. Elsevier. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng S., Xu M., Liu F., Cui C., and Zhou B. 2019. Reconstruction of the full-length transcriptome atlas using PacBio Iso-Seq provides insight into the alternative splicing in Gossypium australe. BMC Plant Biol. 19:365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ferrández-Peral L., Zhan X., Alvarez-Estape M., Chiva C., Esteller-Cucala P., García-Pérez R., Julià E., Lizano E., Fornas Ò., Sabidó E., Li Q., Marquès-Bonet T., Juan D., and Zhang G. 2022. Transcriptome innovations in primates revealed by single-molecule long-read sequencing. Genome Res. 32:1448–1462. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Goldtzvik Y., Sen N., Lam S. D., and Orengo C. 2023. Protein diversification through post-translational modifications, alternative splicing, and gene duplication. Curr. Opin. Struct. Biol. 81:102640. [DOI] [PubMed] [Google Scholar]
- Grive K. J., Hu Y., Shu E., Grimson A., Elemento O., Grenier J. K., and Cohen P. E. 2019. Dynamic transcriptome profiles within spermatogonial and spermatocyte populations during postnatal testis maturation revealed by single-cell sequencing. PLOS Genet. 15:e1007810. Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo J., Sosa E., Chitiashvili T., Nie X., Rojas E. J., Oliver E., Plath K., Hotaling J. M., Stukenborg J.-B., Clark A. T., and Cairns B. R. 2021. Single-cell analysis of the developing human testis reveals somatic niche cell specification and fetal germline stem cell establishment. Cell Stem Cell 28:764–778.e4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hallast P., Balaresque P., Bowden G. R., Ballereau S., and Jobling M. A. 2013. Recombination Dynamics of a Human Y-Chromosomal Palindrome: Rapid GC-Biased Gene Conversion, Multi-kilobase Conversion Tracts, and Rare Inversions. PLOS Genet. 9:e1003666. Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hermann B. P., Cheng K., Singh A., Roa-De La Cruz L., Mutoji K. N., Chen I.-C., Gildersleeve H., Lehle J. D., Mayo M., Westernströer B., Law N. C., Oatley M. J., Velte E. K., Niedenberger B. A., Fritze D., Silber S., Geyer C. B., Oatley J. M., and McCarrey J. R. 2018. The Mammalian Spermatogenesis Single-Cell Transcriptome, from Spermatogonial Stem Cells to Spermatids. Cell Rep. 25:1650–1667.e8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kelemen O., Convertini P., Zhang Z., Wen Y., Shen M., Falaleeva M., and Stamm S. 2013. Function of alternative splicing. Gene 514:1–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kim D., Langmead B., and Salzberg S. L. 2015. HISAT: a fast spliced aligner with low memory requirements. Nat. Methods 12:357–360. Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leung S. K., Jeffries A. R., Castanho I., Jordan B. T., Moore K., Davies J. P., Dempster E. L., Bray N. J., O’Neill P., Tseng E., Ahmed Z., Collier D. A., Jeffery E. D., Prabhakar S., Schalkwyk L., Jops C., Gandal M. J., Sheynkman G. M., Hannon E., and Mill J. 2021. Full-length transcript sequencing of human and mouse cerebral cortex identifies widespread isoform diversity and alternative splicing. Cell Rep. 37:110022. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094–3100. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Makova K. D., Pickett B. D., Harris R. S., Hartley G. A., Cechova M., Pal K., Nurk S., Yoo D., Li Q., Hebbar P., McGrath B. C., Antonacci F., Aubel M., Biddanda A., Borchers M., Bornberg-Bauer E., Bouffard G. G., Brooks S. Y., Carbone L., Carrel L., Carroll A., Chang P.-C., Chin C.-S., Cook D. E., Craig S. J. C., de Gennaro L., Diekhans M., Dutra A., Garcia G. H., Grady P. G. S., Green R. E., Haddad D., Hallast P., Harvey W. T., Hickey G., Hillis D. A., Hoyt S. J., Jeong H., Kamali K., Pond S. L. K., LaPolice T. M., Lee C., Lewis A. P., Loh Y.-H. E., Masterson P., McGarvey K. M., McCoy R. C., Medvedev P., Miga K. H., Munson K. M., Pak E., Paten B., Pinto B. J., Potapova T., Rhie A., Rocha J. L., Ryabov F., Ryder O. A., Sacco S., Shafin K., Shepelev V. A., Slon V., Solar S. J., Storer J. M., Sudmant P. H., Sweetalana, Sweeten A., Tassia M. G., Thibaud-Nissen F., Ventura M., Wilson M. A., Young A. C., Zeng H., Zhang X., Szpiech Z. A., Huber C. D., Gerton J. L., Yi S. V., Schatz M. C., Alexandrov I. A., Koren S., O’Neill R. J., Eichler E. E., and Phillippy A. M. 2024. The Complete Sequence and Comparative Analysis of Ape Sex Chromosomes. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Miga K. H., Koren S., Rhie A., Vollger M. R., Gershman A., Bzikadze A., Brooks S., Howe E., Porubsky D., Logsdon G. A., Schneider V. A., Potapova T., Wood J., Chow W., Armstrong J., Fredrickson J., Pak E., Tigyi K., Kremitzki M., Markovic C., Maduro V., Dutra A., Bouffard G. G., Chang A. M., Hansen N. F., Wilfert A. B., Thibaud-Nissen F., Schmitt A. D., Belton J.-M., Selvaraj S., Dennis M. Y., Soto D. C., Sahasrabudhe R., Kaya G., Quick J., Loman N. J., Holmes N., Loose M., Surti U., ana Risques R., Graves Lindsay T. A., Fulton R., Hall I., Paten B., Howe K., Timp W., Young A., Mullikin J. C., Pevzner P. A., Gerton J. L., Sullivan B. A., Eichler E. E., and Phillippy A. M. 2020. Telomere-to-telomere assembly of a complete human X chromosome. Nature 585:79–84. Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mueller J. L., Skaletsky H., Brown L. G., Zaghlul S., Rock S., Graves T., Auger K., Warren W. C., Wilson R. K., and Page D. C. 2013. Independent specialization of the human and mouse X chromosomes for the male germ line. Nat. Genet. 45:1083–1087. Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Navarro-Costa P., Pereira L., Alves C., Gusmão L., Proença C., Marques-Vidal P., Rocha T., Correia S. C., Jorge S., Neves A., Soares A. P., Nunes J., Calhaz-Jorge C., Amorim A., Plancha C. E., and Gonçalves J. 2007. Characterizing partial AZFc deletions of the Y chromosome with amplicon-specific sequence markers. BMC Genomics 8:342. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nie X., Munyoki S. K., Sukhwani M., Schmid N., Missel A., Emery B. R., DonorConnect, Stukenborg J.-B., Mayerhofer A., Orwig K. E., Aston K. I., Hotaling J. M., Cairns B. R., and Guo J. 2022. Single-cell analysis of human testis aging and correlation with elevated body mass index. Dev. Cell 57:1160–1176.e5. Elsevier. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pardo-Palacios F. J., Arzalluz-Luque A., Kondratova L., Salguero P., Mestre-Tomás J., Amorín R., Estevan-Morió E., Liu T., Nanni A., McIntyre L., Tseng E., and Conesa A. 2023. SQANTI3: curation of long-read transcriptomes for accurate identification of known and novel isoforms. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park E., Pan Z., Zhang Z., Lin L., and Xing Y. 2018. The Expanding Landscape of Alternative Splicing Variation in Human Populations. Am. J. Hum. Genet. 102:11–26. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pearson W. R. 2013. An Introduction to Sequence Similarity (“Homology”) Searching. Curr. Protoc. Bioinforma. 42:3.1.1–3.1.8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pertea G., and Pertea M. 2020. GFF Utilities: GffRead and GffCompare. F1000Research. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ren Y., Tseng E., Smith T. P. L., Hiendleder S., Williams J. L., and Low W. Y. 2023. Long read isoform sequencing reveals hidden transcriptional complexity between cattle subspecies. BMC Genomics 24:108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rhie A., Nurk S., Cechova M., Hoyt S. J., Taylor D. J., Altemose N., Hook P. W., Koren S., Rautiainen M., Alexandrov I. A., Allen J., Asri M., Bzikadze A. V., Chen N.-C., Chin C.-S., Diekhans M., Flicek P., Formenti G., Fungtammasan A., Garcia Giron C., Garrison E., Gershman A., Gerton J. L., Grady P. G. S., Guarracino A., Haggerty L., Halabian R., Hansen N. F., Harris R., Hartley G. A., Harvey W. T., Haukness M., Heinz J., Hourlier T., Hubley R. M., Hunt S. E., Hwang S., Jain M., Kesharwani R. K., Lewis A. P., Li H., Logsdon G. A., Lucas J. K., Makalowski W., Markovic C., Martin F. J., Mc Cartney A. M., McCoy R. C., McDaniel J., McNulty B. M., Medvedev P., Mikheenko A., Munson K. M., Murphy T. D., Olsen H. E., Olson N. D., Paulin L. F., Porubsky D., Potapova T., Ryabov F., Salzberg S. L., Sauria M. E. G., Sedlazeck F. J., Shafin K., Shepelev V. A., Shumate A., Storer J. M., Surapaneni L., Taravella Oill A. M., Thibaud-Nissen F., Timp W., Tomaszkiewicz M., Vollger M. R., Walenz B. P., Watwood A. C., Weissensteiner M. H., Wenger A. M., Wilson M. A., Zarate S., Zhu Y., Zook J. M., Eichler E. E., O’Neill R. J., Schatz M. C., Miga K. H., Makova K. D., and Phillippy A. M. 2023. The complete sequence of a human Y chromosome. Nature 621:344–354. Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rice P., Longden I., and Bleasby A. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16:276–277. Elsevier. [DOI] [PubMed] [Google Scholar]
- Rice W. R. 1987. The Accumulation of Sexually Antagonistic Genes as a Selective Agent Promoting the Evolution of Reduced Recombination between Primitive Sex Chromosomes. Evolution 41:911–914. [Society for the Study of Evolution, Wiley]. [DOI] [PubMed] [Google Scholar]
- Ross M. T., Grafham D. V., Coffey A. J., Scherer S., McLay K., Muzny D., Platzer M., Howell G. R., Burrows C., Bird C. P., Frankish A., Lovell F. L., Howe K. L., Ashurst J. L., Fulton R. S., Sudbrak R., Wen G., Jones M. C., Hurles M. E., Andrews T. D., Scott C. E., Searle S., Ramser J., Whittaker A., Deadman R., Carter N. P., Hunt S. E., Chen R., Cree A., Gunaratne P., Havlak P., Hodgson A., Metzker M. L., Richards S., Scott G., Steffen D., Sodergren E., Wheeler D. A., Worley K. C., Ainscough R., Ambrose K. D., Ansari-Lari M. A., Aradhya S., Ashwell R. I. S., Babbage A. K., Bagguley C. L., Ballabio A., Banerjee R., Barker G. E., Barlow K. F., Barrett I. P., Bates K. N., Beare D. M., Beasley H., Beasley O., Beck A., Bethel G., Blechschmidt K., Brady N., Bray-Allen S., Bridgeman A. M., Brown A. J., Brown M. J., Bonnin D., Bruford E. A., Buhay C., Burch P., Burford D., Burgess J., Burrill W., Burton J., Bye J. M., Carder C., Carrel L., Chako J., Chapman J. C., Chavez D., Chen E., Chen G., Chen Y., Chen Z., Chinault C., Ciccodicola A., Clark S. Y., Clarke G., Clee C. M., Clegg S., Clerc-Blankenburg K., Clifford K., Cobley V., Cole C. G., Conquer J. S., Corby N., Connor R. E., David R., Davies J., Davis C., Davis J., Delgado O., DeShazo D., Dhami P., Ding Y., Dinh H., Dodsworth S., Draper H., Dugan-Rocha S., Dunham A., Dunn M., Durbin K. J., Dutta I., Eades T., Ellwood M., Emery-Cohen A., Errington H., Evans K. L., Faulkner L., Francis F., Frankland J., Fraser A. E., Galgoczy P., Gilbert J., Gill R., Glöckner G., Gregory S. G., Gribble S., Griffiths C., Grocock R., Gu Y., Gwilliam R., Hamilton C., Hart E. A., Hawes A., Heath P. D., Heitmann K., Hennig S., Hernandez J., Hinzmann B., Ho S., Hoffs M., Howden P. J., Huckle E. J., Hume J., Hunt P. J., Hunt A. R., Isherwood J., Jacob L., Johnson D., Jones S., de Jong P. J., Joseph S. S., Keenan S., Kelly S., Kershaw J. K., Khan Z., Kioschis P., Klages S., Knights A. J., Kosiura A., Kovar-Smith C., Laird G. K., Langford C., Lawlor S., Leversha M., Lewis L., Liu W., Lloyd C., Lloyd D. M., Loulseged H., Loveland J. E., Lovell J. D., Lozado R., Lu J., Lyne R., Ma J., Maheshwari M., Matthews L. H., McDowall J., McLaren S., McMurray A., Meidl P., Meitinger T., Milne S., Miner G., Mistry S. L., Morgan M., Morris S., Müller I., Mullikin J. C., Nguyen N., Nordsiek G., Nyakatura G., O’Dell C. N., Okwuonu G., Palmer S., Pandian R., Parker D., Parrish J., Pasternak S., Patel D., Pearce A. V., Pearson D. M., Pelan S. E., Perez L., Porter K. M., Ramsey Y., Reichwald K., Rhodes S., Ridler K. A., Schlessinger D., Schueler M. G., Sehra H. K., Shaw-Smith C., Shen H., Sheridan E. M., Shownkeen R., Skuce C. D., Smith M. L., Sotheran E. C., Steingruber H. E., Steward C. A., Storey R., Swann R. M., Swarbreck D., Tabor P. E., Taudien S., Taylor T., Teague B., Thomas K., Thorpe A., Timms K., Tracey A., Trevanion S., Tromans A. C., d’Urso M., Verduzco D., Villasana D., Waldron L., Wall M., Wang Q., Warren J., Warry G. L., Wei X., West A., Whitehead S. L., Whiteley M. N., Wilkinson J. E., Willey D. L., Williams G., Williams L., Williamson A., Williamson H., Wilming L., Woodmansey R. L., Wray P. W., Yen J., Zhang J., Zhou J., Zoghbi H., Zorilla S., Buck D., Reinhardt R., Poustka A., Rosenthal A., Lehrach H., Meindl A., Minx P. J., Hillier L. W., Willard H. F., Wilson R. K., Waterston R. H., Rice C. M., Vaudin M., Coulson A., Nelson D. L., Weinstock G., Sulston J. E., Durbin R., Hubbard T., Gibbs R. A., Beck S., Rogers J., and Bentley D. R. 2005. The DNA sequence of the human X chromosome. Nature 434:325–337. Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sahlin K., and Mäkinen V. 2021. Accurate spliced alignment of long RNA sequencing reads. Bioinformatics 37:4643–4651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Saut N., Terriou P., Navarro A., Lévy N., and Mitchell M. J. 2000. The human Y chromosome genes BPY2, CDY1 and DAZ are not essential for sustained fertility. Mol. Hum. Reprod. 6:789–793. [DOI] [PubMed] [Google Scholar]
- Saxena R., Brown L. G., Hawkins T., Alagappan R. K., Skaletsky H., Reeve M. P., Reijo R., Rozen S., Dinulos M. B., Disteche C. M., and Page D. C. 1996. The DAZ gene cluster on the human Y chromosome arose from an autosomal gene that was transposed, repeatedly amplified and pruned. Nat. Genet. 14:292–299. Nature Publishing Group. [DOI] [PubMed] [Google Scholar]
- Shumate A., Wong B., Pertea G., and Pertea M. 2022. Improved transcriptome assembly using a hybrid of long and short reads with StringTie. PLOS Comput. Biol. 18:e1009730. Public Library of Science. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Simpson E. H. 1949. Measurement of Diversity. Nature 163:688–688. Nature Publishing Group. [Google Scholar]
- Skaletsky H., Kuroda-Kawaguchi T., Minx P. J., Cordum H. S., Hillier L., Brown L. G., Repping S., Pyntikova T., Ali J., Bieri T., Chinwalla A., Delehaunty A., Delehaunty K., Du H., Fewell G., Fulton L., Fulton R., Graves T., Hou S.-F., Latrielle P., Leonard S., Mardis E., Maupin R., McPherson J., Miner T., Nash W., Nguyen C., Ozersky P., Pepin K., Rock S., Rohlfing T., Scott K., Schultz B., Strong C., Tin-Wollam A., Yang S.-P., Waterston R. H., Wilson R. K., Rozen S., and Page D. C. 2003. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423:825–837. Nature Publishing Group. [DOI] [PubMed] [Google Scholar]
- Soto E. J. L., Gandal M. J., Gonatopoulos-Pournatzis T., Heller E. A., Luo D., and Zheng S. 2019. Mechanisms of Neuronal Alternative Splicing and Strategies for Therapeutic Interventions. J. Neurosci. 39:8193–8199. Society for Neuroscience. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tamura K., Stecher G., and Kumar S. 2021. MEGA11: Molecular Evolutionary Genetics Analysis Version 11. Mol. Biol. Evol. 38:3022–3027. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tomaszkiewicz M., Sahlin K., Medvedev P., and Makova K. D. 2023. Transcript Isoform Diversity of Ampliconic Genes on the Y Chromosome of Great Apes. bioRxiv. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vallender E. J., and Lahn B. T. 2004. How mammalian sex chromosomes acquired their peculiar gene content. BioEssays 26:159–169. [DOI] [PubMed] [Google Scholar]
- Vegesna R., Tomaszkiewicz M., Medvedev P., and Makova K. D. 2019. Dosage regulation, and variation in gene expression and copy number of human Y chromosome ampliconic genes. PLoS Genet. 15:e1008369. Public Library of Science San Francisco, CA USA. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vegesna R., Tomaszkiewicz M., Ryder O. A., Campos-Sánchez R., Medvedev P., DeGiorgio M., and Makova K. D. 2020. Ampliconic Genes on the Great Ape Y Chromosomes: Rapid Evolution of Copy Number but Conservation of Expression Levels. Genome Biol. Evol. 12:842–859. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Veyrunes F., Waters P. D., Miethke P., Rens W., McMillan D., Alsop A. E., Grützner F., Deakin J. E., Whittington C. M., Schatzkamer K., Kremitzki C. L., Graves T., Ferguson-Smith M. A., Warren W., and Graves J. A. M. 2008. Bird-like sex chromosomes of platypus imply recent origin of mammal sex chromosomes. Genome Res. 18:965–973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang E. T., Sandberg R., Luo S., Khrebtukova I., Zhang L., Mayr C., Kingsmore S. F., Schroth G. P., and Burge C. B. 2008. Alternative isoform regulation in human tissue transcriptomes. Nature 456:470–476. Nature Publishing Group. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Waters P. D., Wallis M. C., and Graves J. A. M. 2007. Mammalian sex—Origin and evolution of the Y chromosome and SRY. Semin. Cell Dev. Biol. 18:389–400. [DOI] [PubMed] [Google Scholar]
- Zhang G., Sun M., Wang J., Lei M., Li C., Zhao D., Huang J., Li W., Li S., Li J., Yang J., Luo Y., Hu S., and Zhang B. 2019. PacBio full-length cDNA sequencing integrated with RNA-seq reads drastically improves the discovery of splicing transcripts in rice. Plant J. 97:296–305. [DOI] [PubMed] [Google Scholar]
- Zhou Y., Zhan X., Jin J., Zhou L., Bergman J., Li X., Rousselle M. M. C., Belles M. R., Zhao L., Fang M., Chen J., Fang Q., Kuderna L., Marques-Bonet T., Kitayama H., Hayakawa T., Yao Y.-G., Yang H., Cooper D. N., Qi X., Wu D.-D., Schierup M. H., and Zhang G. 2023. Eighty million years of rapid evolution of the primate Y chromosome. Nat. Ecol. Evol. 7:1114–1130. Nature Publishing Group. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All datasets used in this study, including untargeted IsoSeq, targeted IsoSeq, and Illumina sequencing data are publicly available and listed in Table S1 along with accession numbers.







