Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2017 Oct 19;114(44):11721–11726. doi: 10.1073/pnas.1706502114

High rate of translocation-based gene birth on the Drosophila Y chromosome

Ray Tobler a,b,1, Viola Nolte a, Christian Schlötterer a,2
PMCID: PMC5676891  PMID: 29078298

Significance

Using a powerful method that uses inexpensive short reads to detect Y-linked transfers, we show that gene traffic onto the Drosophila Y chromosome is 10 times more frequent than previously thought and includes the first Y-linked retrocopies discovered in these taxa. All 25 identified Y-linked gene transfers were relatively young (<1 million years old), although most appear to be pseudogenes because only three of these transfers show signs of purifying selection. Our method provides compelling evidence that the Drosophila Y chromosome is a highly challenging and dynamic genetic environment that is capable of rapidly diverging between species and promises to reveal fundamental insights into Y chromosome evolution across many taxa.

Keywords: Y chromosome, evolution, Drosophila, retrocopies, transposition

Abstract

The Y chromosome is a unique genetic environment defined by a lack of recombination and male-limited inheritance. The Drosophila Y chromosome has been gradually acquiring genes from the rest of the genome, with only seven Y-linked genes being gained over the past 63 million years (0.12 gene gains per million years). Using a next-generation sequencing (NGS)-powered genomic scan, we show that gene transfers to the Y chromosome are much more common than previously suspected: at least 25 have arisen across three Drosophila species over the past 5.4 million years (1.67 per million years for each lineage). The gene transfer rate is significantly lower in Drosophila melanogaster than in the Drosophila simulans clade, primarily due to Y-linked retrotranspositions being significantly more common in the latter. Despite all Y-linked gene transfers being evolutionarily recent (<1 million years old), only three showed evidence for purifying selection (ω ≤ 0.14). Thus, although the resulting Y-linked functional gene acquisition rate (0.25 new genes per million years) is double the longer-term estimate, the fate of most new Y-linked genes is defined by rapid degeneration and pseudogenization. Our results show that Y-linked gene traffic, and the molecular mechanisms governing these transfers, can diverge rapidly between species, revealing the Drosophila Y chromosome to be more dynamic than previously appreciated. Our analytical method provides a powerful means to identify Y-linked gene transfers and will help illuminate the evolutionary dynamics of the Y chromosome in Drosophila and other species.


The heterochromatic, repeat-laden nature of the Drosophila Y chromosome makes it difficult to analyze, such that its evolution is still poorly understood. Only 12 Y-linked genes have been discovered on the Drosophila melanogaster Y chromosome, all of which arose by transfers from autosomes onto the Y chromosome (14). Because only transfers that produce functional Y-linked copies can be detected over long evolutionary timescales, we hypothesized that the underlying primary gene transfer rate may be considerably higher. To investigate this, we developed a method to detect recent gene transfers onto the Y chromosome (GeTYs). We reasoned that mapping short reads from inbred males to a female reference genome would produce polymorphisms in genes that had spawned a Y-linked duplication, whereas the same genomic region should be homozygous for short reads in females from the same inbred strain. Following this idea, we developed a metric for identifying Y-linked transfers (Methods and SI Methods) and applied it to two inbred strains from three Drosophila species: D. melanogaster, Drosophila simulans, and Drosophila mauritiana, which diverged between ∼0.24 Mya [D. simulans and D. mauritiana (5)] and 5.4 Mya [D. simulans clade and D. melanogaster (6)]. Unlike other methods that exploit sex-specific short read alignments to identify Y chromosome sequences (3, 7), our method does not require preassembled Y contigs.

SI Methods

Individual Strain Datasets.

For the three species, D. mauritiana, D. melanogaster, and D. simulans, single males and females from two different inbred strains were independently sequenced.

For D. melanogaster, we chose the reference genome strain y1; cn1 bw1 sp1 (25) (stock center no. 14021-0231.36) and a strain collected in Póvoa de Varzim, Portugal, in 2008 (by P. Orozco-terWengel) that had been inbred for 29 generations. The two D. simulans strains were collected in Madagascar in 1998 (M3 by B. Ballard) and in 2009 (Mad5 by E. Hellmich) and each had been inbred for six generations. The D. mauritiana strains were collected in Mauritius in 2006 (RED7 by M. M. Ramos) and 2007 (Lab2 by C. Schlötterer) and each had been inbred for 12 generations.

Genomic DNA of individual female and male flies was extracted using a standard high-salt extraction protocol (23) and fragmented with a Covaris S2 device (Covaris, Inc.). We prepared paired-end libraries following a modified NEBNext Ultra DNA Library Prep (E7370L) protocol with an insert size of 400 bp and dual-index barcodes (E7600S). Libraries were amplified using nine PCR cycles with Q5 high-fidelity polymerase and sequenced with a 2 × 120 bp protocol on a HiSeq2500.

Reads were mapped against the M252 reference genome for D. simulans (24), the MS17 reference genome for D. mauritiana (26), and Flybase v5 for D. melanogaster. To detect Y-linked transfers, our method required mapping reads derived from sequenced inbred individuals against a feminized genome, whereby all Y-linked and unassigned contigs were removed from the reference sequences before mapping (i.e., any contigs that could contain Y-linked sequence). Mapping was performed using the standard DistMap pipeline (27) (v. 1-2-1) using BWA aln and sampe (18) (v. 0.5.8c; aln flags -o 1 -n 0.01 -l 200 -e 12 -d 12), with initial low-quality trimming performed on the 3-prime read ends. Following the mapping, samtools (28) was used to filter for mapping quality (quality ≥ 20) and proper pairs (v. 0.1.18; flags -q 20 -f 0x0002 -F 0x0004 -F 0x0008). SNPs were called using the PoPoolation2 (29) pipeline, with SNPs being called for the combined individuals from each species and bases having quality <20 being excluded from this process. This resulted in 6,299,308, 5,492,989, and 7,848,332 SNPs being called for D. mauritiana, D. melanogaster, and D. simulans, respectively. Finally, only SNPs with a minimum coverage >10 in all sex and strain combinations were considered for GeTY detection (this criterion was applied separately to each of the four flies in a given species, so only SNPs passing this criterion in all four flies were retained), leaving 5,381,582, 4,861,458, and 6,911,697 SNPs available for the analyses in D. mauritiana, D. melanogaster, and D. simulans, respectively. Illumina read data for all strains and species are available at European Nucleotide Archive (ENA) under accession no. PRJEB22850.

Detection of Y-Linked Transfers.

Two innovative recent studies have demonstrated how Y chromosome sequences can be identified by leveraging sex-specific coverage information from short read alignments to Y-specific contigs (3, 7). Our approach also exploits sex-specific differences in short read alignments but focuses on the detection of autosomal or X-linked sequences that have transferred to the Y chromosome. Accordingly, our approach does not require preassembled Y contigs: for each of the three species D. mauritiana, D. melanogaster, and D. simulans, individuals of both sexes were separately sequenced for two distinct strains and mapped to their species-specific female reference genome (i.e., Y and unassigned contigs removed). This strategy facilitates the mapping of any Y-linked translocations to their parental ortholog, should it be present in the reference, and ongoing divergence between donor and Y-linked copies does not preclude read alignment. Accordingly, read alignments containing polymorphisms that are present in males but absent in females signify the donor region of a Y-linked transfer. We developed a metric that leveraged these expected sex-specific SNP differences, while also controlling for false positives arising from residual heterozygosity that is still segregating within a given strain (such sites should be strain specific). In the following, we outline the steps taken to identify SNPs indicating Y-linked transfers for each species:

  • i)

    We called SNPs by contrasting 120 bp paired-end Illumina reads for males and females from the same isofemale line (Individual Strain Datasets). For each SNP, two sets of CMH tests (30) were performed. One contrasted male and female allele frequencies for each strain (i.e., where the replicates were the strains), and the other contrasted allele frequencies between the pairs of the same sex across both strains (i.e., where the replicates were the sexes). This procedure yielded P values for sex- and strain-specific allele frequency differences.

  • ii)

    These CMH test P values were log transformed to obtain a pair of values for each SNP, −log10(PCMH-sex) and −log10(PCMH-strain). The −log10(PCMH-sex) values were binned according to the −log10(PCMH-strain) value (bins = 0–10 in increments of 1, 12–20 in increments of 2, and 25 and above in increments of 5). For each log10(PCMH-strain) bin, the log10(PCMH-sex) values for all SNPs in this bin were standardized to obtain a new metric, PSTD,ij, as follows:

PSTD,ij={log10(PCMH-sex)ijE[log10(PCMH-sex)j]}/σ[log10(PCMH-sex)j] [S1]

where i indicates the SNP and j indicates the bin. This procedure was designed to remove any strain-specific signal for SNPs that are heterozygous in males, but not females, for a given strain (i.e., residual heterozygosity that is still segregating within inbred lines). The X chromosome and autosomes were standardized separately to account for coverage differences due to male hemizygosity for the X chromosome (thereby controlling for potential power differences between X-linked and autosomal SNPs).

  • iii)

    Following this transformation and binning process, outliers were identified as those SNPs with PSTD,ij ≥ 2, having an allele frequency >10% in both males and ≤0.1% in both females. Large PSTD,ij values indicate SNPs that have large sex-specific differences after having accounted for the strain-specific differentiation and are therefore expected to be enriched with novel Y-linked variants that arose after the transfer. The allele frequency thresholds further enforce our expectation that novel Y-linked variants should be absent in females (allowing for some sequencing and mapping error) and frequent in males. Notably, although we might naively expect the Y-linked allele frequency to be ∼1/3 for transfers arising from autosomal copies (one Y-linked copy vs. two autosomal copies) and ∼1/2 for transfers with X-linked origins (one Y-linked copy vs. one X-linked copy), the real frequency will be influenced by a number of systematic factors including positive mapping bias for reads from the donor copy (decreasing the allele frequency), the existence of additional copies of the Y-linked transfer on the autosomes or X chromosome (extra copies on these chromosomes should be shared between both sexes and therefore should lead to decreased allele frequency), and additional copies of the Y-linked transfer on the Y chromosome (which could either lead to an increased allele frequency if the majority of novel Y-specific variants are shared across on all Y-linked copies or a decreased frequency if most Y-specific variants are not shared across all copies). Because many of these factors will lead to reductions in the expected frequency, we decided to use a more permissive male-specific allele frequency threshold of 10%. Applying this criterion to all SNPs resulted in 3,320, 6,320, and 2,643 outlier SNPs in D. mauritiana, D. melanogaster, and D. simulans, respectively.

  • iv)

    All outlier SNPs were assigned to categories based on whether they fell in specific genes or TEs. TEs were determined with RepeatMasker (31) (version open-4.0.5; parameter settings -gccalc -s -cutoff 200 -no_is -nolow -norna -gff -u -pa 4). Gene boundaries were defined according to the gene models for each species [annotation files: D. melanogaster = Drosophila_melanogaster.BDGP5.77.gtf (ftp.ensembl.org/pub/release-77/embl/drosophila_melanogaster/), D. mauritiana = dmau-MS17-popgen-ann-r1.2-exon-CDS.gtf, and D. simulans = JMCE01.2-exon-CDS-utr.gtf]. The less comprehensive annotations of D. mauritiana and D. simulans relative to D. melanogaster meant that genes for the former two species were more likely to not include UTRs.

  • v)

    We identified potential Y transfers by the presence of at least three outlier SNPs in annotated genes or TEs. Because some DNA-based transfers may have included regions flanking the focal gene (which may also include several other genes), we iteratively included all surrounding SNPs that fell within 5 kb of the terminal SNP in the growing region. Thus, the final Y transfer region ended when no further outlier SNP could be identified within 5 kb of the terminal SNPs in the region. As such, the identified regions could potentially comprise more than one annotated gene. Additionally, we identified putative Y-linked transfers whose donor regions did not include either annotated genes or TEs. These transfers were determined as regions containing at least five outlier SNPs, where each pair of neighboring SNPs were within 5 kb of each other and where gene and TE annotations were absent. To check that these regions were truly intergenic in D. mauritiana and D. simulans (whose annotations are less exhaustive than D. melanogaster), reads aligning to the putative nongenic transfers were BLASTed against the D. melanogaster v6 reference on the Flybase website to check for missing annotations. This process resulted in the identification of 35, 38, and 24 putative Y-linked transfers, comprising a total of 885, 2,479, and 588 Y-linked SNPs, for D. mauritiana, D. melanogaster, and D. simulans, respectively.

  • vi)

    It is possible that some of the outlier SNPs were not Y-linked substitutions but rather sites that had initially been polymorphic at the time of the transfer but where the Y-linked allele had since come to differ from the donor allele in our strains. To identify and remove such sites from our analyses, we used high-coverage Pool-Seq data derived from an independent sample of females in each species to determine falsely called Y-linked substitutions (see specific details in Avoiding Biases from Incorrectly Called Y-Linked Substitutions). For each species, all sites where an apparent Y-linked substitution was also found to be segregating in the donor region of the relevant female population data were characterized to have originally occurred in the donor region before the transfer, rather than on the Y chromosome following the transfer. After removing all such sites from our outlier sets, this reduced the number of outlier SNPs to 836, 2,402, and 446 Y-linked SNPs, for D. mauritiana, D. melanogaster, and D. simulans, respectively.

  • vii)

    Y-linked transfers comprising several TEs could result from multiple TE translocations rather than a single transfer involving genic sequences. To exclude such occurrences from our final set of Y-linked transfers, we determined the proportion of SNPs lying within 1 kb of TEs and defined the final set of Y-linked transfers as those containing at least five outlier SNPs—after filtering for false positive outlier SNPs in step vi—where less than 20% of these SNPs were within 1 kb of TEs. After performing the preceding steps in each of the three species, we obtained a total of 66 incipient Y-linked transfers (28, 18, and 20 Y-linked transfers for D. mauritiana, D. melanogaster, and D. simulans, respectively).

  • viii)

    Several of the incipient Y-linked transfers had donor regions that were relatively close together, with many such transfers being proximal to heterochromatic regions, suggesting that they were originally part of a single transfer event. Thus, we considered all Y-linked transfers lying within 200 kb of one another as composing a single transfer. This resulted in 51 consensus Y-linked transfers overall (22, 13, and 16 Y-linked transfers for D. mauritiana, D. melanogaster, and D. simulans, respectively). Six of these transfers were shared between at least two of the three species (i.e., arose in the common ancestor; Identification of Shared Y-Linked Transfers), such that there were 45 unique Y-linked transfers in total, 25 of which contained at least one gene (i.e., were GeTYs). See Dataset S1 for a full list of the incipient and consensus transfers and Dataset S11 for all diagnostic outlier SNPs.

Avoiding Biases from Incorrectly Called Y-Linked Substitutions.

As mentioned in step vi of our pipeline (outlined in Detection of Y-Linked Transfers), a potential problem with the identification of Y-linked outlier SNPs arises for sites that were polymorphic in donor sequence at the time of the transfer. In some cases, these ancestral polymorphisms will lead to incorrectly ascertaining an outlier SNP, if the allele carried on the Y-linked transfer differs from the allele that is observed in the donor region of the two strains used in this study. To account for this category of false positive outlier SNPs, we surveyed a large population sample of female flies from each species for evidence of outlier SNPs whose Y-linked allele was segregating the donor region. We defined such false positive outlier SNPs to have frequencies ≥0.02 and coverage ≥10 in species-specific female Pool-Seq data. The data used for D. mauritiana came from a sample of 107 isofemale lines reported in ref. 26. The D. melanogaster data were derived from 113 isofemale lines, originally collected from Portugal in July 2008 (32, 33). The D. simulans data were drawn from a large sample of 426 isofemale lines collected from Kanonkop, South Africa, in 2013 (the data are available at ENA under accession no. PRJEB22850). To avoid introducing bias into our analyses, all outlier SNPs with Y-linked alleles that were found to be segregating in female Pool-Seq data were either ignored or the Y-linked allele was converted to the donor nucleotide.

False Discovery Rate Estimation.

To determine the number of falsely called GeTYs in our pipeline, we repeated the transfer detection pipeline having switched the role of the sexes at step iii; that is, outlier SNPs were now identified as those SNPs with PSTD,ij ≥ 2 and an allele frequency >10% in both females and ≤0.1% in both males. Notably, making females the focal sex resulted in a large reduction in the number of detected outlier SNPs compared with when males were the focal sex [percentage of female:male outlier SNPs at the end of step 3: 11.4% (376/3,320) in D. mauritiana, 0.8% (51/6,320) in D. melanogaster, and 7.5% (198/2,643) in D. simulans; Fig. 1B]. By the end of the pipeline, no GeTYs were detected for any of the three species when females were the focal sex. This is because none of the detected regions contained at least five SNPs (maximum = four SNPs). Thus, our method provides robust detection of GeTYs, with a false discovery rate that is indistinguishable from zero in the present study.

Fig. 1.

Fig. 1.

Schematic overview of the Y-linked transfer detection pipeline. In step 1, two separate Cochran–Mantel–Haenszel (CMH) tests were performed and then combined to identify outlier SNPs; the first CMH test contrasted sexes with the strains (different colored flies) as replicates (A), and the second CMH test contrasted the strains with sexes as replicates. For all three species, more outlier SNPs (StdDiff > 2) were detected for male-specific variants (M) than for female-specific variants (F) (B), indicating that our pipeline was accurately identifying Y transfers. In step 2, male-specific outlier SNPs are grouped into clusters (C). The annotation of genomic regions containing outlier SNPs is indicated by different color codes. Although GeTYs are broadly dispersed across the genome of each species, TE peaks typically cluster within heterochromatic regions. In step 3, reads containing Y-linked variants were used in the de novo assembly of the Y-linked haplotype for each incipient transfer. No anno, no recorded annotation; TE, transposable element. (D) The Integrative Genomics Viewer (IGV) screen shot for the read alignment (Bottom) and subsequent de novo assembled Y-linked haplotype [green (Velvet) and red (Transabyss) bars; Top], relative to the donor gene annotation for GeTY (blue bars; Top). The final step of the pipeline involved the iterative aggregation of incipient transfers lying within 200 kb of one another into a single consensus transfer.

Coverage-Based Validation of Transfers.

As an additional verification of the robustness of our Y transfer detection, we note that the donor regions of Y-linked transfers are expected to have higher mapping coverage in males than females, due to the additional mapping of Y-linked reads in such regions. To this end, we calculated the male:female weighted coverage ratio (WCR) for each SNP. The raw coverage of each SNP was weighted by dividing this value by the median coverage for either the autosomes or X chromosome, depending on the location of the SNP. This weighting accounts for coverage differences across the sexes and strains, and also between autosomal and X-linked loci in males, providing a standardized measure of coverage that is comparable across all loci. The median coverage was preferred over the mean coverage because the mean coverage values were positively skewed due the presence of a long tail of large coverage values, whereas the median matched expectations (that is, the median X chromosome coverage was approximately half of the median autosomal coverage of the autosomes for males, and the X:autosome coverage ratio was approximately equal for females across all species and strains). A single WCR value was then calculated for each SNP by dividing the weighted coverage for males by the concomitant female value for each strain and then averaging the result across the two strains. Finally, we generated the observed WCR for each incipient transfer by taking the median value across all outlier loci used to identify the region (again the median was preferred over the mean value, to better account for positive skew in the WCR distribution).

Naïve expectations suggest that the observed WCR for regions that did not result in a Y-linked transfer should be ∼1 (that is, males and females should exhibit similar coverage levels after correcting for sex-specific coverage differences), but for regions with Y-linked transfers the WCR is expected to be significantly larger than 1. Although under the simplest Y-linked transfer scenario the mean WCR is expected to be 1.5 for autosomal regions with transfers and 2 for X-linked regions, estimation of the WCR is complicated by several factors (which are synonymous with the factors affecting Y-linked allele frequency estimation discussed in step iii of Detection of Y-Linked Transfers). These factors include (i) positive mapping bias for reads from the donor copy (decreasing the WCR), (ii) the existence of additional copies of the Y-linked transfer on the Y chromosome (increasing the WCR) or on the autosomes or X chromosome (extra copies on these chromosomes should be shared between both sexes and therefore should lead to decreased WCR, particularly if there are extra copies on the X chromosome), and (iii) mapping noise (e.g., coverage variation across the sexes not being sufficiently accounted for after dividing by the median coverage, or coverage may differ for the Y-linked copies relative to the autosomal/X donor copies). Finally, we generated an approximate P value for each incipient transfer, by first calculating the quantile for every SNP with a coverage ≥10 across all four individuals of a given species, then taking the median quantile value across all SNPs for each incipient transfer. Because the WCR expectations differ for the autosomes and the X chromosome (1.5 vs. 2, respectively), the quantiles were calculated separately for autosomal and X-linked SNPs. By setting the largest WCR values to have the lowest quantile values, the resulting median quantile value for each incipient transfer provides a rough estimate of the probability of obtaining the observed WCR for this region, given the empirical distribution of these values. Because this statistic does not explicitly account for correlations due to linkage disequilibrium between neighboring SNPs, it can only be considered as approximate.

Our results reveal that all but one of the putative Y-linked transfers had an observed WCR greater than 1 (Dataset S1; values for individual diagnostic outlier SNPs can be found in Dataset S11), with more than half of the incipient transfers having P ≤ 0.05 (34 out of 66). Further, only one Y-linked transfer had median quantile >0.5, which is also the only transfer with an observed WCR < 1 (incipient name: 3R:2134959–2135737; consensus name: Dmau_3R_2.13; Dataset S1). Because this transfer only comprised six outlier SNPs, the corresponding WCR estimate is likely to be affected by noise. When looking across all SNPs, the WCR values are weakly correlated to the StdDist metric used to identify the GeTYs (Pearson’s correlation coefficient: 0.07, 0.11, and 0.15 for D. mauritiana, D. melanogaster, and D. simulans, respectively). Thus, the WCR appears to capture additional information about Y-linked transfers that is largely independent of the StdDist metric used identify outlier SNPs and further supports the validity of the detected transfers.

PCR Validation of Y-Linked Transfers.

We used PCR to validate the Y-linked status of the transfers from two species, D. simulans and D. mauritiana, using the two strains that we used to identify the transfers (D. simulans: strain M3 and Mad5, D. mauritiana: strain Lab2 and RED7). For all PCRs, genomic DNA was extracted from individual virgin females and males of each strain, and an equal amount of DNA from each sex was used for PCR. PCR reactions were performed with 0.5 pmol primer in 10- or 20-μL reactions. The polymerase used for each PCR is indicated in Dataset S5. The standard cycling protocol was initial denaturation at 98 °C for 30 s followed by 32 cycles with 98 °C for 10 s, annealing temperature for 30 s, 72 °C for 30 s, and a final extension step at 72 °C for 5 min. Annealing temperature and modifications to the extension time for each fragment are indicated in Dataset S5.

For a handful of Y-linked transfers (Dsim_2R_4.35, Dsim_X_7.6, Dmau_2L_15.71, and Dmau_2R_18.4), we designed primers situated in the flanking regions of introns that were absent in the Y-linked haplotype but present in the autosomal/X-linked donor copy. Notably, these instances also serve as in vitro confirmation of Y-linked retrocopy transfers. Consequently, Y-linked retrocopies were confirmed when two distinct PCR bands were observed in males—a longer DNA segment for the parental ortholog and a shorter (intronless) DNA segment for the Y-linked haplotype—whereas only the longer amplicon was observed in females. For two of the putative Y-linked retrocopies in D. simulans (Dsim_2R_4.35 and Dsim_X_7.6), we extended the PCR to a global sample of 25 D. simulans isofemale lines (Dataset S6), in which the Y chromosomal retrocopy was found to be present across all strains. Genomic DNA was extracted from each isofemale line as described above, and PCR reactions were carried out with FireTaq polymerase (Solis BioDyne) at an annealing temperature of 55 °C.

For all other PCR validations, primers were designed that flanked an indel of at least 20 bp that was specific to either the Y-linked or autosomal copy, and PCR products were run on 2% agarose gels (8 cm in length) to resolve the size difference. The Y chromosomal copy was confirmed when a PCR product with the size of the autosomal/X-linked variant was amplified in females, and two PCR products approximating the size of the autosomal/X-linked and the Y-linked variant were detected in males. For GeTYs that were very similar in size to the autosomal copy and for which the Y chromosome-specific primers did not have perfect specificity—i.e., amplified a fragment of very similar size in females and males—the Y-specificity was inferred from a pronounced difference in amplification strength between females and males of both strains.

For the majority of fragments, we designed allele-specific primers targeting Y chromosome or autosome-specific SNPs. If multiple Y chromosomal and/or X chromosomal/autosomal haplotypes were present, primers were designed based on the sequence of two mates in a paired-end read to ensure that a specific haplotype was amplified. Primers were designed with Primer3 (bioinfo.ut.ee/primer3/) or manually with a subsequent quality control with Primer3 and the Integrated DNA Technologies (IDT) OligoAnalyzer 3.1 Tool (https://eu.idtdna.com/calc/analyzer). See Dataset S5 for all primer sequences. For all PCR assays, the sex status of all individuals was screened in parallel using a Y chromosome-specific kl-5 primer.

Evidence for Duplication of Y-Linked Transfers.

Two lines of evidence suggest that some of the Y-linked transfers had undergone additional bouts of duplication on the Y chromosome following the original transfer.

First, when designing primers for the PCR assays used to validate the Y-linked transfers in D. simulans and D. maurititiana, some regions were found to harbor reads coming from multiple distinct haplotypes (Dataset S5). These multihaplotype reads were typically restricted to males, suggesting that they were duplications of the Y-linked copy of the original transfer. Second, many of these putative multicopy Y-linked transfers also displayed highly significant WCR values that were elevated beyond the expected values (Dataset S1), providing independent evidence that these Y-linked transfers were duplicated. For instance, the incipient transfer shared between D. mauritiana and D. simulans (2R:9503872–9507919 and 2R:9412575–9416269; Dataset S1) has a mean WCR > 5 in both species, suggesting that several additional duplicated Y-linked copies arose following the original transfer of this region.

The putative duplicated Y-linked transfers (list in Table 1) could have been involved in gene conversion events (or even concerted evolution) or led to chimeric haplotypes being reconstructed for these transfers (Y-Linked Transfer Haplotype Reconstruction). In principle, such factors may have affected two downstream analyses that made explicit use of the reconstructed haplotypes, namely, estimation of transfer times (Estimating the Time of Transfer) and detection of purifying selection (Testing GeTY Functionality). Although it is difficult to predict exactly how such Y-linked duplications might affect these analyses, we note that none of the suspected duplicated transfers showed any evidence for purifying selection (Table 1 and Dataset S9), nor did the distribution of Y-linked transfer times change significantly after excluding these regions (Table 1 and Dataset S7). Thus, the findings from these two analyses were likely robust to posttransfer Y-linked duplication events.

Table 1.

Summary of Y-linked transfers

Species Transfer ID Transfer type Donor(s) ω Expression Age (Ky)
D. mauritiana Dmau_2L_6.65 Ambig Hrb27C 1.02,Filt NA 673 [309,1885]
Dmau_2L_9.1 Ambig numb nORF NA 460 [211,1287]
Dmau_2L_12.3 DNA CG5787;Pih1D1 Filt;Filt,Filt NA 609 [280,1706]
Dmau_2L_15.71 RNA CG4455 1.38 NA 517 [237,1446]
Dmau_2R_0.15* DNA NA NA NA 429 [197,1201]
Dmau_2R_4.35 RNA 14–3-3zeta nORF NA 383 [176,1072]
Dmau_2R_8.65 DNA L;ttv;LamC nORF;nORF;0.14 NA 153(240) [70(110),429(672)]
Dmau_2R_9.41* RNA SRPK Filt NA 429 [197,1201]
Dmau_2R_13.28 DNA CG7229 Filt NA 80 [37,224]
Dmau_2R_18.4 RNA CG3511 0.95 NA 147 [68,413]
Dmau_2R_19.04 DNA NA NA NA 305 [140,854]
Dmau_3L_0.16 RNA CG13876 0.30 NA 158 [73,443]
Dmau_3L_2.15 DNA NA NA NA 98 [45,273]
Dmau_3L_7.55 DNA CG7492;Ank2 nORF;Filt NA 213 [98,597]
Dmau_3L_22.2 DNA NA NA NA 957 [439,2679]
Dmau_3R_0.34* DNA NA NA NA 380 [174,1064]
Dmau_3R_1.23 DNA Nmdar1;dmau_PG00479 nORF;nORF NA 282 [130,790]
Dmau_3R_2.13 DNA NA NA NA 28 [13,78]
Dmau_3R_13.94 DNA Tctp 0.14 NA 282 [129,790]
Dmau_X_3.07 DNA CG16781;CG12206 Filt,Filt;0.11 NA 683(778) [314(357),1913(2180)]
Dmau_X_8.42* Ambig His3.3B 2.83 NA 436 [200,1221]
Dmau_X_20.09* DNA NA NA NA 535 [246,1499]
D. melanogaster Dmel_2L_4.46 DNA Gs1l;RpL27A nORF;nORF NA 45 [21,125]
Dmel_2L_12.86 DNA NA NA NA 174 [80,487]
Dmel_2L_19.94 DNA sick nORF NA 559 [256,1564]
Dmel_2L_22.75 DNA NA NA NA 697 [320,1951]
Dmel_2R_0.09 DNA NA NA NA 624 [287,1748]
Dmel_2R_0.57 DNA NA NA NA 303 [139,848]
Dmel_2R_2.32 DNA NA NA NA 315 [145,883]
Dmel_3L_23.41 DNA NA NA NA 443 [203,1241]
Dmel_3L_24.3 DNA NA NA NA 343 [158,962]
Dmel_3R_17.04 Ambig CR43975 nORF NA 516 [237,1444]
Dmel_3R_20.95 DNA vig2;Mocs2/CG42503;Clbn;Bili 0.53;0.45;0.64;Filt NA 463(497) [213(391),1297(1391)]
Dmel_X_12.65 DNA ade5;CG12717 nORF;Filt NA 725 [333,2031]
Dmel_X_12.66 DNA NA NA NA 430 [197,1204]
D. simulans Dsim_2L_11.91 DNA bru1 Filt 2/11 [3] 169 [77,472]
Dsim_2L_12.3 DNA CG5787;Pih1D1 Filt;nORF 14/17 [4.9];4/6 [1.2] 813 [373,2275]
Dsim_2L_15.71 RNA CG4455 nORF 22/22 [8.7] 197 [90,551]
Dsim_2L_19.34 DNA NA NA NA 351 [161,983]
Dsim_2R_0.06 DNA NA NA NA 267 [122,747]
Dsim_2R_4.35 RNA 14–3-3zeta Filt 0/2 [18.5] 90 [41,252]
Dsim_2R_9.41* RNA SRPK Filt 25/29 [575.6] 204 [94,573]
Dsim_3L_7.55 DNA CG7492;Ank2 nORF;0.63,0.62 8/9 [11.2];32/36 [6.6] 383 [176,1072]
Dsim_3L_10.87 RNA Sod 0.09 1/8 [10.4] 182(477) [84(219),509(1335)]
Dsim_3L_22.2* DNA NA NA NA 408 [187,1142]
Dsim_3R_4.18 DNA NA NA NA 161 [74,452]
Dsim_3R_12.73 DNA NA NA NA 96 [44,269]
Dsim_3R_13.94* DNA Tctp nORF 0/30 [17.4] 197 [90,551]
Dsim_X_7.6 RNA Sdt Filt 11/30 [2.3] 374 [172,1047]
Dsim_X_15.22 RNA Cyp1 0.78 0/18 [4.5] 222 [102,623]
Dsim_X_20.2 DNA CG17450/CG32819/CG32820 Filt 0/8 [19.5] 268 [123,750]

A total of 45 unique Y-linked transfers were detected, arising either as retrocopies (RNA), as DNA translocations (DNA), or via an undetermined mode (Ambig). Twenty-five of the Y-linked transfers harbored at least one gene—i.e., were GeTYs [donor gene names in donor(s) column]—with six GeTYs being shared between species (italicized rows). In all columns, GeTYs comprising several genes have each gene name separated by semicolons, with those having identical gene models being separated by a forward slash. Purifying selection was detected for three GeTYs (ω column; genes having more than one detected ORF being further delineated by a comma), whereas others showed evidence of degeneration (ω column; nORF = no ORF; Filt = Y-linked or donor CDS lacked either a start or stop codon, contained an inactivating mutation, or >10% Y-linked codons were missing in donor alignment; Dataset S6). Two publicly available testes-specific RNA-Seq datasets (23, 24) revealed weak evidence for Y-linked GeTY expression, with most GeTYs having low relative expression (i.e., the fraction of diagnostic exonic sites where the Y-linked allele contributed >1% of the total coverage for that site; fraction shown in expression column) and low absolute expression (i.e., mean coverage of Y-linked alleles mapping to diagnostic exonic sites; values in square brackets in expression column). Point estimates of transfer times revealed evolutionarily recent origins, with the oldest transfer arising around 1 Mya (see age column; error margins in square brackets, age estimates for putatively functional genes after correcting for purifying selection shown in standard parentheses; Dataset S7). For some Y-linked transfers, multiple haplotypes were detected in male-specific short read data suggesting that these transfers likely underwent subsequent duplication on the Y chromosome, with GeTY Dsim_2R_9.41 also showing signs of an additional autosomal/X-linked duplication. Age and expression estimates may be unreliable for these Y-linked transfers.

*

Putative duplicated transfers.

Postmultiple testing correction significance (q < 0.05).

Y-Linked Transfer Haplotype Reconstruction.

We performed de novo assembly for each of the 66 incipient transfers to reconstruct the transferred sequence and learn more about the timing and subsequent evolutionary history of each of the Y-linked transfers. First, reads containing these Y-specific SNPs that map to the donor region of each transfer were extracted along with their paired-end mates from the pool of male reads in both strains for each species using a custom python script, which is available on Dryad (doi:10.5061/dryad.8ph59). We used two separate de novo assembly algorithms to reconstruct the Y-specific haplotypes for each Y-linked transfer: trans-ABYSS v1.5.3 (19) and velvet v1.2.10 (34). For trans-ABYSS, assembly was performed at a variety of kmer lengths (28, 36, 44, 52, 60, 68, 76, 84, and 92) and otherwise default parameters. For velvet, we used velvetoptimizer v2.2.5 (https://github.com/tseemann/VelvetOptimiser) for kmer lengths ranging from 31 to 91 in steps of 10 and selection of the best contigs (parameters –x 10 –k ‘max’ –c ‘max’). After the contigs from each assembler had been generated, a single representative contig was chosen for each of the 66 incipient Y-linked transfers as being among the longest assemblies and showing a high-quality GMAP (genomic mapping and alignment program) (35) alignment score to the donor region (i.e., high q value). The resulting set of contigs is listed in Dataset S2.

Testing Quality of Y-Linked Transfer Haplotype Reconstruction.

The youngest functional Y-linked gene in D. melanogaster (FDY; parental ortholog = vig2) originated from the transfer of a single genomic region comprising five contiguous genes, with the other four genes (Mocs2, CG42503, Caliban, and Bili) being pseudogenized (9). Note that Mocs2 and CG42503 share the same gene model but have unique annotations, such that they can be regarded as a single gene for the purposes of the transfer event. Hence, we tested the efficiency of our approach to detect and reconstruct the Y-linked transfers by mapping our de novo assembled Y-linked contigs from this region against a 200-kb PacBio Y-linked contig that bears the full translocation (see ref. 9 for details on the contig). Because we were not able to generate a single contig representing the entire transfer, we took four separate large contigs that comprised independent subregions of the transfer (Dataset S2). All four of these contigs produced near-perfect alignments with the 200-kb PacBio Y-linked contig. The only discrepancy is a two-base pair insertion in our GeTY corresponding to FDY; nonetheless, this is situated in the middle of an intronic region and consequently is unlikely to inhibit function. These results confirm that our haplotype reconstruction method was of high quality and therefore representative of the Y-linked region.

Identification of Shared Y-Linked Transfers.

Having high-quality reconstructions of the Y-linked transfers facilitated the identification of transfers shared between two or more species that came from a single event in the common ancestor. To this end, all of the top contigs from across the three species were aligned against one another using BLAT (BLAST-like alignment tool) (36). Shared Y-linked transfers were then determined as those having subsequent alignments with at least 250 matched base pairs, greater than 90% similarity [matched bps/(matched bps + mismatched bps)] and lying more than 1 Mb from heterochromatic regions. The last step was included to avoid spurious hits caused by the increased levels of repeats and duplications lying adjacent to the heterochromatic regions. After performing these steps, a total of six Y-linked transfers were found to have arisen in a common ancestor of D. mauritiana and D. simulans, with all six being GeTYs (Dataset S1).

Donor Gene Enrichment Analyses.

We used the Database for Annotation, Visualization, and Integrated Discovery (DAVID) v6.8 (37, 38) to test if GeTY donor genes were overrepresented in particular gene functions/pathways. To avoid spurious enrichment results, for each instance where two or more donor genes had identical gene models but unique annotations, we chose only one of these genes to represent the region (Mocs2 for Mocs2/CG42503 and CG17450 for CG17450/CG32819/CG32820). This resulted in a set of 36 genes that were used for the DAVID enrichment tests. Separate enrichment tests were performed for biological processes, cellular components, and molecular function, with categories in each being assessed as significant if they had a FDR < 0.05. Only one category, cytoplasm (GO:0005737) in cellular components, was found to be significant (Dataset S3). Unfortunately, the broadness of this category precludes making further conclusions about the nature of the GeTY donor genes.

We also tested whether these 36 GeTY donor genes were specifically enriched for male-biased genes, using a previously published list of male-biased genes (39) derived from the National Human Genome Research Institute Model Organism Encyclopedia of DNA Elements (modENCODE) D. melanogaster dataset (40). This list contains all D. melanogaster genes that show significant male-biased expression for one or more of nine different metrics, which include both whole-body or gonad-specific measures (39). A total of 5,826 genes (42.8% of all D. melanogaster genes) showed evidence for male-biased expression for at least one of the nine metrics (39), compared with 10 of the 36 (25.6%) GeTY donor genes, a difference consistent with chance (χ2 test; P = 0.10; Dataset S4). Similar comparisons were performed for the subset of genes with evidence for at least two of the male bias metrics, at least three male bias metrics, and so on, up to those with evidence for all nine of the male bias metrics. In no case was there a significant excess (nor deficit) of male-biased GeTY donor genes (Dataset S4). Thus, there is no evidence that the Y-linked gene transfers observed in this study are more likely to have male-biased expression than expected by chance.

In Silico Retrocopy Verification.

To verify the nine suspected Y-linked retrocopies, we repurposed an RNA-mapping algorithm to confirm the missing introns in the Illumina short DNA reads. If the DNA reads truly originate from retrocopies, then we expect to see split read alignments to the parental ortholog that specifically span the spliced-out introns specific to each retrocopy. For each GeTY, we used the read pairs harboring Y-linked mutations described above (Y-Linked Transfer Haplotype Reconstruction) and remapped them to the species specific reference using GSNAP (Genomic Short-read Nucleotide Alignment Program) (version 2014-05-15, flags -N 1 -t 20 -A sam) (35). The remapped reads were then used to generate Sashimi plots (41) in IGV (42), a graphic display used in RNA expression studies to show isoform frequencies by using gapped read information to quantify how often specific introns are skipped. In this case, the number of gapped reads spanning an intron provides further evidence that the read came from the Y-linked DNA haplotype lacking that intron (that is, it arose from a retrocopy). Seven retrocopies were identified this way (Dmau_2L_15.71/Dsim_2L_15.71, Dmau_2R_4.35/Dsim_2R_4.35, Dmau_2R_9.41/Dsim_2R_9.41, Dmau_2R_18.4, Dsim_3L_10.87, Dsim_X_7.6, and Dsim_X_15.22; Fig. S2). Additionally, for GeTYs coming from single exon genes—which consequently lack introns—retrocopy status was also conferred to those GeTYs that had reads aligning entirely within the donor exon and ended abruptly at the edge of the annotated sequence (that is, reads did not continue into the neighboring intergenic region, after allowing for mapping artifacts). This resulted in one more GeTY, Dmau_3L_0.16, being called as retrocopy (Fig. S3A). One other GeTY, Dmel_3R_17.04, had a large number of reads that terminated in an intergenic region only 9 bp from the 3′ end of the annotated parental gene (Fig. S3B). Notably, the parental gene is currently designated as a pseudogene in Flybase and lacks any UTR annotations. Given the uncertainty about the status of this GeTY, we took a conservative approach and classified it as ambiguous (i.e., insufficient evidence for classification as a retrocopy or DNA-based translocation).

Fig. S2.

Fig. S2.

Retrocopy validation. Y-linked retrocopies were validated using a mixture of in silico and in vitro techniques. Sashimi plots depicting GMAP alignments of reads carrying Y-linked variants from (A) GeTYs Dsim_3L_10.87 and (B) Dsim_X_15.2. Read depth is shown on the y axis, and chromosomal position is shown on the x axis. The donor gene annotation is shown at the base of each panel (blue bars; thin lines indicate introns, internal bars indicate exons, and thick terminal bars indicate UTRs). In both cases a large number of reads are split across an annotated intron in the donor copy (see values positioned within the curved black lines), indicating that the intron is missing in the Y-linked copy that arose from a retrocopy transfer. (C) Results from PCR assays of two GeTYs in D. simulans (lanes 2–5) and one in D. mauritiana (two different introns; lanes 6–9). Products were taken from both males (♂) and females (♀), with males showing an additional unique band that is consistent with intronless GeTY. A GeneRuler 100 bp Ladder (Fermentas) is shown in lanes 1 and 10.

Fig. S3.

Fig. S3.

Intronless retrocopy validation. IGV screen capture of showing the GMAP alignments of Y-linked reads with to their donor genes for two GeTYs. (A) A GeTY in D. mauritiana that has reads that terminate abruptly at the end of annotated gene region, indicative of a retrocopy. (B) The GeTY has reads that terminate 9 bp downstream of the 3′ end of the annotation, whereby this GeTY was conservatively called as ambiguous (that is, it cannot be confidently assigned as either an DNA or RNA transfer).

Estimating the Time of Transfer.

The timing of each Y-linked transfer event was estimated using the divergence between the Y-linked and donor copies for the 51 consensus Y-linked transfers (i.e., the de novo assembled incipient transfers after combining those inferred to come from a single transfer). First, divergence was estimated by aligning the incipient Y-linked haplotypes to the respective donor region. The alignment was performed using GMAP (35) (v. 2014-05-15, flags –t 10 –A –f), which explicitly accounts for the possibility of missing introns (particularly important for retrocopies). Divergence was calculated as the total number of mismatches in the alignment divided by the total alignment length. To avoid upwardly biasing age estimates, we ignored all mismatches that were not called as an SNP in the two males strains used for each species to identify the Y-linked transfers. Manual inspection of alignments revealed that such mismatches were likely due to misalignments between the reconstructed Y haplotype and the donor copy. We also ignored mismatches that were found to be polymorphic in species-specific female Pool-Seq data because these represent ancestral polymorphisms and not Y-linked substitutions (Avoiding Biases from Incorrectly Called Y-Linked Substitutions). After masking all such mismatches, a simple point estimate for the time of transfer was calculated as ts/L/2mu, where t is time in years, s is the adjusted number of substitutions, L is the length of the alignment (i.e., s/L = proportion of diverged sites), and mu denotes the per generation nucleotide mutation rate. The mutation rate estimate of 2.2 × 10−9 mutations per site per generation was taken from Keightley et al. (20) and divided by 10, to get the expected number of mutations per site per generation per year. We used the standard errors (1 × 10−9 and 6.1 × 10−9) from the mutation rate estimates to generate the error associated with our dating. For consensus transfers that combined two or more incipient transfers, d was tallied across the separate incipient transfers. The results for each consensus sequence are shown in Dataset S7.

A caveat of our method is that we are unlikely to be able to detect relatively recent transfer events, because very young GeTYs will not have accumulated sufficient differences from the donor sequence to be recognized in our analysis (we require at least five differences between donor and transferred region; Detection of Y-Linked Transfers). Also, some age estimates are likely to be downwardly biased for GeTYs due to the activity of purifying selection in the donor region and possibly the Y-linked region. To account for the downward bias in GeTY age estimation due to purifying selection, we recalculated the age of four GeTYs that had evidence of negative selection (Testing GeTY Functionality). This was done by using PAML (Phylogenetic Analysis by Maximum Likelihood) (22) to estimate the number of potential nonsynonymous sites (N), and the proportion of mutated nonsynomous and synonymous sites (dN and dS, respectively), for each of the significant GeTYs. Because under neutrality, dN and dS have the same expectation, we calculated the number of mutated sites that were removed by purifying selection as N(dSdN), given that the number of nonsynonymous sites expected under neutrality equals NdN and the number of observed nonsynonymous sites equals NdS, such that the missing number of nonsynonymous sites is the difference between these two values. This number of sites was then added to the total number of divergent sites previously estimated for each GeTY based on the GMAP alignment, and the age estimates were recalculated as above.

Finally, we performed a Kolmogorov–Smirnov test to examine if there were any differences in the age distribution of GeTYs and nongenic transfers. The six shared GeTYs were assigned the mean age of both species, except in the case of shared GeTY 2R_9.41 where only the D. mauritiana value was used (the aging of Dsim GeTY 2R_9.41 being unreliable due to a probable back-transfer of the Y-linked region to an autosome/X chromosome in this species; Y-Linked Gene Back-Transfer). Many of six GeTYs that were shared between D. mauritiana and D. simulans had estimated ages that occurred after the split date between the two species, which suggests that our age estimates for some transfers may have negatively biased by factors such as gene conversion and purifying selection. Therefore, we performed tests before and after correction for the effects of purifying selection (Avoiding Biases from Incorrectly Called Y-Linked Substitutions) and also with and without GeTYs involved in additional Y-linked duplications (Coverage-Based Validation of Transfers) because the latter may also have affected our estimates [e.g., paralogous Y-linked copies provide additional opportunities for gene conversion and concerted evolution (43)]. None of the resulting tests were significant (all transfers, not adjusting for selection, P = 0.58; all transfers, adjusting for selection, P = 0.80; no duplicate transfers, not adjusting for selection, P = 0.80; no duplicate transfers, adjusting for selection, P = 0.74). The mean age values for these different categorizations of the transfers are shown in Dataset S7.

Y-Linked Gene Back-Transfer.

A notable issue with the point estimation of the gene transfer events was that the estimated ages of the six transfers shared between D. mauritiana and D. simulans often differ by several hundred thousand years, and several precede the reported split date between the two species (∼0.24 Mya; ref. 5). Although much of these differences may be accounted for by the activity of purifying selection for some period in the past, we noted some additional abnormalities in the case of Dsim_2R_9.41 and Dmau_2R_9.41: many of the putative divergent sites for this GeTY were found to be polymorphic in the female Pool-Seq data for D. simulans (74 sites out 115 divergent sites), but this was not the case for D. mauritiana (only 1 out of 100) (Dataset S7), suggesting that another factor may have led to these disparate ages. Two scenarios stand out: (i) the Y-linked copy was subsequently transferred elsewhere in the genome or (ii) there was a gene conversion event from the Y-linked retrocopy and the donor gene in D. simulans (SRPK gene on 2R). Either of these scenarios would mean that our age estimate for the GeTY in D. simulans was downwardly biased (in addition to any role that purifying selection may have had) as a result of resetting all Y-linked substitutions to the allele on the donor copy where these were found to be segregating in a large female population sample.

To investigate this further, we used GSNAP to align short read data from 35 different D. simulans females to the M252 reference genome. Each female was derived from a cross between an inbred Florida line and the M252 reference genome strain (44, 45). Two of these D. simulans females (I027 and I211) showed evidence for scenario 1, i.e., that GeTY Dsim_2R_9.41 had transferred back to the autosome or X chromosome (Fig. S1). First, many of the diagnostic Y-linked SNPs for Dsim_2R_9.41 were found segregating in the two female crosses, but this was not the case for other GeTYs (ruling out the possibility that these two females were actually genetically male). Second, several reads mapping to the donor gene SRPK in both female crosses spanned the same intron that is also missing in GeTY Dsim_2R_9.41 (and also Dmau_2R_9.41), indicating that the retrocopy underlying Dsim_2R_9.41 must have been transferred to another chromosome in these females (Fig. S1). Finally, the mapping coverage was higher for regions of SRPK that are orthologous with GeTY Dsim_2R_9.41 than in the flanking regions in each of the female crosses, suggesting that the additional reads arose from a transposed copy of Dsim_2R_9.41 rather than a Y-to-autosome gene conversion event (the latter event is not expected to change the coverage relative to the background). Several Y-linked alleles specific to GeTY Dsim_2R_9.41 were also found segregating in both female crosses (data). This confirms that the donor gene SRPK first transferred to the Y chromosome in the common ancestor of D. simulans and D. mauritiana, before transferring to another chromosome in D. simulans later, rather than occurring first as an autosomal/X transfer before transferring to the Y chromosome in the common ancestor.

Fig. S1.

Fig. S1.

Autosomal/X-linked GeTY transfer. IGV screenshot showing evidence for putative Y-to-autosome/X chromosome gene transfer involving GeTY Dsim_2R_9.41. The first panel shows the annotations for the donor regions and the alignment of the GeTY Dsim_2R_9.41. The second through fourth panels show DNA read alignment and splicing patterns for two D. simulans female crosses (second and third panels) and the combined data for two strains used in to detect the GeTYs. The splicing patterns show that the intron that was absent in the GeTY Dsim_2R_9.41, a retrocopy, is also absent in some of the reads from the two D. simulans females. The higher coverage of the exonic regions in these females implies that GeTY Dsim_2R_9.41 was involved in a Y-to-autosome transfer that is present in these lines.

Testing GeTY Functionality.

Each of the incipient GeTYs (Dataset S1) was tested for selection by determining if ω (dN/dS) was significantly different from neutrality (i.e., ω ≠ 1). AUGUSTUS v3.1 (21) was used to search for ORFs in all Y-linked haplotypes (settings: –species = fly –strand = both –singlestrand = false –genemodel = partial –codingseq = on –sample = 100 –keep_viterbi = true –alternatives-from-sampling = true –minexonintronprob = 0.2 –minmeanexonintronprob = 0.5 –maxtracks = 2 /data/www/augustus/tmp/AUG-1655008088/input.fa–exonnames = on). ORFs were identified for 24 of the 34 incipient GeTYs, with some GeTYs having more than one ORF detected (Dataset S8). For each predicted ORF, we determined the paralogous coding sequence (CDS) in the focal species and also D. yakuba (with the latter being used to determine the ancestral donor sequence that was used in the subsequent selection tests; see below), by using Exonerate (46) to align the predicted ORF to the reference genome for each species (v. 2.2.0; flags: –model protein2genome –bestn 1). To correct for incorrectly called Y-linked variants arising from ancestral donor copy polymorphisms (Avoiding Biases from Incorrectly Called Y-Linked Substitutions), all Y-linked substitutions that were found to be polymorphic in female Pool-Seq data were converted to the allele carried by the donor copy. After making these corrections, all three coding sequences were realigned using PRANK (Probabilistic Alignment Kit) (47, 48) (v. 140603; default settings).

To ensure reliable ω estimates, we discarded any predicted Y-linked ORFs that lacked a complete coding sequence (that is, either the start codon or stop codon were not part of the AUGUSTUS predicted Y-linked ORF). Further, we only retained ORFs that had at least 90% of their codons aligning to each of the donor and D. yakuba gene copies and lacked any frameshift mutations or premature stop codons in these alignments. The GeTY Dsim_2R_9.41 was also excluded on the basis that any signal of selection would most likely be diluted by an additional back-transfer to another chromosome (Y-Linked Gene Back-Transfer). After this filtering steps, 15 ORFs were retained, which came from eight GeTYs from D. mauritiana, three from D. simulans, and one from D. melanogaster (some GeTYs having more than one predicted ORF; Table 1 and Dataset S8).

For each of the remaining GeTYs, we used CODEML (PAML v. 4.8a; ref. 22) to determine the ancestral CDS occurring at the node immediately before the Y-linked transfer, by using a phylogeny including the Y-linked and donor sequences, along with the orthologous sequence from D. yakuba, and setting RateAncestor = 1 to estimate the ancestral sequence. We then used the pairwise comparison option of CODEML (i.e., runmode = −2) to estimate ω by contrasting the Y-linked and ancestral coding sequences. Comparing the Y-linked sequence against the putative ancestral sequence facilitates estimation of the synonymous and nonsynonymous mutation rates specific to the Y-linked lineage following the transfer. Purifying selection was tested by contrasting two models, one model (M0) where ω was fixed at 1 (i.e., enforcing neutral evolution) and a second model (M1) where ω was free to vary (M0 parameters: fix_omega = 1, omega = 1”; M1 parameters: fix_omega = 0, omega = 0.5; shared parameters: model = 0, noisy = 9, runmode = −2, seqtype = 1, clock = 0, NSsites = 0, icode = 0, fix_kappa = 0, kappa = 1, fix_alpha = 1, alpha = 0, Malpha = 0, CodonFreq = 2, getSE = 0, RateAncestor = 0, method = 0). Likelihood ratio tests for selection were subsequently performed for each GeTY, by assuming that twice the difference between the log likelihoods of the two models is χ2 distributed with 1 degree of freedom. To account for multiple testing, the resulting P values were transformed to q values using the qvalue R package (49). The results for the ω estimation and likelihood tests are provided in Dataset S9. We note that none of the GeTYs that were significant in these tests show any evidence of Y-linked gene duplication [all with mean WCR ratios ≤ 1.610 (Dataset S1) and none had multiple male-specific Y-linked haplotypes (Dataset S6)], suggesting that these results were not an artifact of Y-linked gene conversion or concerted evolution.

GeTY Expression.

We quantified the level of expression of the D. simulans GeTYs using two publicly available testes-specific RNA-Seq datasets. Data for the D. simulans w501 strain were taken from a study by Rogers et al. (50) (Sequence Read Archive (SRA) accession numbers SRR1520537, SRR1548740, and SRR1548741). The second data set comprised three replicates of RNA-Seq data from the testes and reproductive tracts of a D. simulans strain from Nanyuki, Kenya (stock number 14021-0251.199), that have been generated and made available by the modENCODE Consortium (51, 52) (SRA accession numbers SRR330571, SRR330572, and SRR330573). To measure the level of expression of the Y-linked alleles, all reads from the testes datasets were combined and aligned to the D. simulans M252 reference genome (24) without the U contigs using GSNAP with the option -N 1 to infer novel splicing events (35). SNPs were called using the PoPoolation2 (29) pipeline, with reads having a mapping quality less than 20 being discarded. For each GeTY, we discarded positions other than those used in the haplotype reconstruction (i.e., SNPs that were polymorphic in female population sample were discarded). Additionally, SNPs that lay outside of exons were also discarded. This resulted in a final set of high-quality exonic SNPs for each GeTY.

To determine whether a GeTY was expressed, we quantified both the absolute and relative number of reads carrying Y-linked alleles mapping to each variant position (for a given variant, the relative count equals the number of reads carrying Y-linked alleles divided by the total number of reads mapping to that variant). Although the absolute number of reads provides a direct expression assay, it does not take into account sequencing errors, which will become more problematic as the coverage for the variant increases. However, relying solely on the relative count will miss lowly expressed genes because these may fall under the sequencing error threshold. The testes expression data reveal that most of the Y-linked genes have low levels of absolute expression, typically on the order of tens of reads (relative to hundreds to thousands of reads for the donor copies; Table 1, Fig. S4, and Dataset S10), such that most Y-linked genes also show low relative expression (i.e., the fraction of reads mapping to a diagnostic site where the Y-linked allele contribute >1% of the total coverage; here we assume a 1% error rate for base calls in Illumina sequencing, which thus provides a threshold for distinguishing between sequencing artifacts and true expression), providing weak evidence for Y-linked expression overall. Notably, there is no evidence for expression in the putatively functional Y-linked transfer in D. simulans, GeTY Dsim_3L_10.87 (Table 1 and Dataset S10). The absence of evidence for Y-linked expression in these lines does not then necessarily imply that the GeTY is nonfunctional: this Y-linked gene (and others) may be expressed in tissues other than the testes or may only be expressed in the testes under certain environmental conditions.

Fig. S4.

Fig. S4.

GeTY expression patterns. Box plot showing the log10 transformed absolute expression (i.e., coverage depth) for donor and Y-linked copies of the GeTYs detected in D. simulans. For each GeTY, expression was quantified for each exonic diagnostic outlier SNP.

Results

Y-Linked Transfer Properties and Pipeline Validation.

Our method detected numerous putative Y-specific sequences that mapped to feminized reference genomes from three Drosophila species (Fig. 1). Consistent with the high repeat content of the Y chromosome, many clusters of SNPs were located in or near heterochromatic regions that overlapped transposable elements (TEs) (Fig. 1). Restricting our analysis to regions that lacked TEs and contained at least five Y-specific SNPs resulted in 66 incipient Y-linked transfers. After combining incipient transfers from closely neighboring regions, we obtained 45 unique Y-linked consensus transfers across the three species (Table 1 and Datasets S1 and S2). Twenty-five of these consensus transfers were GeTYs, including six that were shared between D. simulans and D. mauritiana (Fig. 2). The set of donor genes underlying the Y-linked transfers were broadly dispersed over the genome and were not enriched with respect to functional category or male-biased expression (Datasets S3 and S4).

Fig. 2.

Fig. 2.

Origin of Y-linked gene transfers. Retrocopies (circles), DNA translocations (diamonds), and ambiguous transfers (squares) are indicated on the inferred branch of origin in the D. melanogaster clade. Divergence times are shown at the red nodes. Shared GeTYs are only found in the D. simulans clade. The D. simulans clade also contains a significant excess of GeTYs relative to D. melanogaster, which appears to be primarily driven by a surplus of retrocopy transfers. Note that the branch lengths are not to scale; both the D melanogaster and D. simulans clade branches are truncated (depicted by the break points).

To validate our analytical pipeline, we performed a combination of in vitro and in silico tests. First, we estimated the false discovery rate (FDR) of our pipeline by rerunning it in full after reversing the role of the two sexes (SI Methods). Because no transfers were detected in this sex-reversed pipeline, the estimated FDR in the present study is indistinguishable from zero. Second, we reasoned that the donor regions of the Y-linked transfers should have significantly elevated coverage in males relative to females of the same strain, after weighting the coverage to account for variation across samples and chromosomes (SI Methods). Indeed, the male:female weighted coverage ratio (WCR) was consistently higher in the detected transfer regions than expected according to the empirical WCR distribution (and this was significant for more than half of the incipient transfers; Dataset S1), suggesting our pipeline accurately identified Y-linked transfers. Finally, we confirmed the existence of all Y-linked transfers from D. mauritiana and D. simulans by determining that the associated Y-linked sequences generated PCR amplicons in males only (Dataset S5 and SI Methods). Notably, our results suggested that a handful of Y-linked transfers had been involved in additional bouts of duplication on the Y chromosome (Table 1 and SI Methods). One of these GeTYs (Dsim_2R_9.41) also showed evidence for subsequent transfer onto the autosome or X chromosome (Fig. S1 and SI Methods), supporting a previous report that the Y chromosome is also an occasional source for gene transfers to other chromosomes in Drosophila (8).

Y-Linked Transfer Haplotype Assembly.

To facilitate additional analyses of the Y-linked transfers, we reconstructed the Y-linked haplotype for each of the incipient transfers by extracting all reads mapping to the donor region that contained putative Y-linked alleles, then using these reads to de novo assemble the translocated sequence (SI Methods). We checked the quality of our de novo assemblies by looking in more detail at the reconstructed haplotype for GeTY Dmel_3R_20.35, for which a 200-kb contig bearing the full translocation is publically available (9) (SI Methods). GeTY Dmel_3R_20.35 was previously reported to be a DNA translocation that contains the youngest functional Y-linked gene described for D. melanogaster to date (FDY; parental ortholog: vig2) (9). The original DNA translocation also included additional genes that show evidence of pseudogenization (Moc2/CG42503, Caliban, and Bili) (9). Our four Y-linked haplotypes from this region produced near-perfect alignments with the published 200-kb Y-linked contig bearing the full translocation (9) (SI Methods), confirming the high quality of our Y-linked haplotype reconstructions.

Divergent Modes of Gene Transfer onto the Drosophila Y Chromosome.

Y-linked transfers can be generated by two distinct mechanisms: either via a translocation of a genomic region (i.e., DNA translocations) or through the integration of reverse transcribed genes (i.e., retrocopies) (10, 11). Although all nongenic transfers are necessarily DNA translocations, GeTYs may arise from either mechanism. For intron-bearing genes, the distinction between the two mechanisms is straightforward: in the case of retrocopies, alignment of the Y-linked haplotypes to the parental ortholog will show evidence for splicing (should the donor gene contain exons) and will lie within donor gene boundaries. In contrast, DNA translocations will include intronic sequences, and the translocated regions need not coincide with parental gene boundaries. Based on these characteristics, 8 of the 25 GeTYs were Y-linked retrocopy insertions (5 in D. mauritiana, 6 in D. simulans, 3 shared; 0 in D. melanogaster) and 13 were DNA translocations (7 in D. mauritiana, 5 in D. simulans, 3 shared; 4 in D. melanogaster) (Table 1, Figs. S2 and S3, and Dataset S5). The transfer mechanism of the remaining four GeTYs (three in D. mauritiana and one in D. melanogaster) could not be unambiguously determined (SI Methods). Three lines of evidence suggest that the observed Y-linked transfers are probably fixed within each species: (i) the effective population size of the Y chromosome is relatively small (25% that of autosomes), (ii) two D. simulans Y-linked retrocopies analyzed in a PCR assay were fixed in a global sample of 25 males (Dataset S6 and SI Methods), and (iii) some gene transfers are shared between species.

Although the number of nongenic transfers is similar across the three species (8 in D. melanogaster, 7 in D. mauritiana, and 5 in D. simulans), the 5 GeTYs in D. melanogaster are significantly fewer than the 20 independent transfers observed in the D. simulans clade (P = 0.011, Poisson test; Methods). This discrepancy appears to be largely driven by a significantly elevated Y-linked retrocopy insertion rate in the D. simulans clade (P = 0.004, Poisson test), a result that is even more remarkable given the absence of evidence for Y-linked retrocopies in Drosophila to date. Notably, D. melanogaster has the most complete gene annotation of the three species in this study, implying that interspecies differences in the quality and quantity of gene annotations did not bias our Y-linked retrocopy detection. These findings suggest that Y-linked gene transfer rates, and the underlying molecular mechanisms driving the translocations, can undergo significant divergence over relatively brief evolutionary time spans in Drosophila.

Y-Linked Gene Transfers Are Recent but Show Limited Evidence of Purifying Selection.

The general lack of shared Y-linked transfers across all three species suggests that the observed Y-linked transfers were relatively recent. We estimated the age of each transfer using a method that minimizes the influence of ancestral polymorphisms by ignoring Y-linked substitutions that are still segregating in the donor copy in large female samples (Methods and SI Methods). Although coarse, our estimates indicate that the Y-linked transfers are evolutionarily recent—with all arising within the past 1 My (Table 1 and Dataset S7)—and that the age distribution of GeTYs and nongenic transfers were not significantly different (P = 0.58, two-sided Kolmogorov–Smirnov test; P = 0.74 after GeTYs adjusted for the effects of purifying selection and putative duplicated Y-linked transfers removed; SI Methods). Although the latter result implies that GeTYs were effectively behaving like neutral loci, the fact that the only shared Y-linked transfers between species were GeTYs suggests they were subject to stronger Y-linked purifying selection than nongenic transfers in general. Further, many of these shared transfers had estimated ages that were younger than the recorded split between the two species, which may have resulted from purifying selection removing new mutations in the Y-linked copies. Thus, we performed two additional analyses to determine if any of the GeTYs were functional.

First, we measured ω, the ratio of nonsynonymous to synonymous substitutions that had accumulated in each Y-linked copy following the transfer, and tested whether this ratio differed from neutral expectations (i.e., ω significantly less than 1; Methods and SI Methods). To avoid a bias toward high ω values due to incorrect gene models, we performed de novo gene predictions for each GeTY and only retained instances where the predicted Y-linked ORF included the start and stop codons and produced a largely complete (i.e., >90% of the predicted codons could be aligned) and consistent (i.e., contained no frameshifts or stop codons) alignment with the donor copy from the focal species and Drosophila yakuba. Our results revealed that many of the GeTYs contained incomplete ORFs or inactivating mutations (Table 1 and Dataset S8), with only three being maintained by purifying selection after transferring to the Y chromosome: two GeTYs in D. mauritiana (LamC on Dmau_2R_8.73 and CG12206 on Dmau_X_3.07) and one in D. simulans (Sod on Dsim_3L_10.87) (all ω ≤ 0.14 and q ≤ 0.03; Table 1 and Dataset S9). Notably, several genes had low to moderate ω values but were not significant (Table 1)—including FDY, the recently discovered young Y-linked gene in D. melanogaster (9)—suggesting that our selection tests were probably conservative.

As a second test of GeTY functionality, we used testis-specific RNA-Seq data from D. simulans to quantify the expression of the GeTYs in this species (SI Methods). Several GeTYs showed weak evidence for low levels of expression; however, this did not include the functional GeTY identified in the ω-based tests (Fig. S4 and Dataset S10). Although limitations in our tests may have precluded detection of some functional Y-linked genes, the evidence indicates that purifying selection has played a minor role in maintaining recent gene transfers onto the Drosophila Y chromosome.

Discussion

High Rates of Y-Linked Gene Traffic.

Our unbiased approach to detect Y-linked gene transfers has uncovered several fundamental aspects of Y chromosome evolution in Drosophila. We observe a high transfer rate of primary genetic material from the rest of the genome onto the Y chromosome (1.67 per My; Methods), which exceeds the slow accumulation of functional Y-linked genes inferred for the Drosophila genus over longer evolutionary times by an order of magnitude [0.12 per My (12)]. Despite being much higher than previous estimates, the primary Y-linked gene acquisition rate inferred here appears to be up to an order of magnitude lower than for the rest of the genome for Drosophila (1315), although the inclusion of different transfer categories in previous studies (e.g., de novo genes and intrachromosomal transfers) complicates direct comparisons. Further, despite a handful of GeTYs showing evidence for functionality [4/25 = 16%, including FDY in D. melanogaster (9)]—which lead to a functional Y-linked gene acquisition rate that is approximately double the previous estimate (12) (four functional GeTYs/5.4 My/three species = 0.25 new genes/species/My)—many of the GeTYs did not have complete ORFs, contained frameshift mutations, or showed no evidence of expression. This implies that the majority of the Y-linked transfers reported here have become pseudogenes and that the Y chromosome is a less hospitable genetic environment for new gene evolution than the rest of the genome in Drosophila.

The Dynamic and Challenging Genetic Environment of the Drosophila Y Chromosome.

A recent study revealed that D. melanogaster has more Y-linked genes than Drosophila virilis, primarily due to the higher number of gene gains in the former since the two species last shared a common ancestor (3). This result suggests that the elevated rate of functional Y-linked gene acquisition reported here may reflect a general acceleration in this rate across the Drosophila subgroup relative to their sister taxa. The mechanistic driver behind this putative lineage-specific change remains unknown, but possible factors include increased accessibility of the Y chromosome to transfers (e.g., due to more relaxed chromatin conformation) or improved efficacy of Y-linked selection (e.g., due to increased effective population size for the Y chromosome), among others. Alternatively, the rate of functional GeTY acquisition reported here could be a transient phenomenon, whereby the short-term rate (over ∼1 My) eventually converges with the slower long-term rates (over ∼60 My). The efficacy of selection on weakly deleterious mutations is reduced on the Drosophila Y chromosome relative to other chromosomes (16), whereby many of the newly transferred genes, including those currently under selection, could become pseudogenized over longer time periods. Consistent with this idea, many of the GeTYs displaying evidence for low levels of expression in D. simulans were also present in D. mauritiana; however, none of these shared GeTYs displayed significant purifying selection in either species. Additional testing on more Drosophila species is ultimately required to determine the temporal stability the Y-linked gene transfer rate, although the acquisition rate of duplicated genes on the autosomes and X chromosome appears to be relatively stable over long periods in Drosophila (15), particularly for retrocopies (14). Regardless of the underlying cause of the temporal disparity in Y-linked gene gains reported here, when combined with the significant differences in Y-linked retrocopy traffic across the D. melanogaster subgroup, our study reveals that the Drosophila Y chromosome is an even more dynamic genetic environment than previously appreciated and is capable of undergoing significant changes over relatively short evolutionary time scales.

Conclusion

In contrast to many other species, the Drosophila Y chromosome is a highly dynamic genetic environment. For example, in D. pseudoobscura the Y chromosome is not homologous to the ancestral Drosophila Y chromosome but has arisen de novo (17). We have shown that Y-linked gene acquisition over the past million years is a highly dynamic feature of the Drosophila Y chromosome, with 10 times more gene traffic and twice the number of functional gene gains than are expected given the Y-linked gene acquisition rate recorded over the past 63 My (12). In addition to heterogeneous Y-linked gene acquisition dynamics, our method has revealed previously unknown properties of the Y chromosome, i.e., frequent retrocopy traffic onto the Y chromosome, which appears to have lineage-specific dynamics. Further research is required to determine whether this pattern reflects an actual elevation in the functional gene acquisition rate or represents the short-term evolutionary dynamics of the Y chromosome, which will eventually converge to the slower long-term rate. Similarly, we still do not know what the ancestral Y-linked retrocopy transfer rate was and how this is evolving in general across the Drosophila complex. These questions and many others can be empirically tested by applying the present method to the growing number of Drosophila species with reference genomes. Moreover, because our method can determine Y-linked gene transfers using inexpensive short reads and does not depend on a preassembled Y chromosome or associated contigs, it holds the potential reveal fundamental details of Y chromosome evolution in many other species at a hitherto unmatched level of resolution.

Methods

Y-Linked Transfer Identification and Haplotype Assembly.

Using Illumina paired end reads, we sequenced males and females of two strains from all three species and used Burrows-Wheeler Aligner (BWA) (18) to map reads on reference genomes lacking Y chromosomes. SNPs with large allele frequency differences between the two sexes from the same strain were determined (see Dataset S11 for a list of all diagnostic SNPs) and then grouped into larger regions according to inter-SNP distances and gene boundaries. Haplotypes for the resulting regions were de novo assembled with trans-ABySS (Assembly by Short Sequences) (19) combining the subset of reads containing Y-linked alleles from both strains. A detailed explanation of the analytical pipeline and haplotype assembly is provided in SI Methods.

Estimating GeTY Age and Function.

The age of each GeTY was estimated using the Y-specific nucleotide divergence from the parental ortholog scaled by the D. melanogaster base substitution rate (2.8 × 10−9; ref. 20). Tests for purifying selection were performed by using AUGUSTUS (21) to predict Y-linked ORFs for each GeTY, then using codon-based phylogenetic analyses implemented in PAML (Phylogenetic Analysis by Maximum Likelihood) (22) to estimate ω for each ORF relative to the reconstructed ancestral donor sequence and using likelihood ratio tests to determine whether these ω estimates significantly differed from 1. More details on the aging and function tests are provided in SI Methods.

Y-Linked Gene Transfer Rate.

The transfer rate was calculated as the average number of GeTYs observed across the three lineages divided by the estimated time of divergence between D. melanogaster and the D. simulans clade. To ensure phylogenetic independence in the D. simulans clade, we counted the transfers shared between species only once and added this number to the average of the remaining species-specific transfers in this clade [i.e., 6 + (5 + 9)/2 = 13]. Thus, the effective number of transfers, Neff, is equal to Nmel + Nsim_clade = 5 + 13 = 18. To get the average number of Y-linked transfers per lineage, we divided Neff by the number of distinct lineages, L, and divided this value by the divergence time, d, to, derive the average transfer rate: (Neff/L)/d = (18/2)/5.4 My = 1.67 novel GeTYs per lineage per My. Note that this serves as a lower bound to the true GeTY rate because transfers will be unobserved if they have degraded sufficiently to prevent read alignment or because they lack the required number of diagnostic divergent SNPs to be determined in our pipeline (i.e., ≤5 SNPs; SI Methods).

Lineage-Specific Rate Tests.

We modeled gene transfers on the Y chromosome as a Poisson process where λ is the Y-linked gene transfer rate. Differences in the Y-linked gene transfer rate between the D. melanogaster and D. simulans lineages were tested by determining the probability of observing at most the number of D. melanogaster-specific transfers given the average number of transfers specific to the two species in the D. simulans clade. Phylogenetic independence was accounted for in the same way as for the estimation of the gene transfer rate (see above). This method was applied to all gene transfers, and DNA translocations and retrocopies separately, resulting in the following probabilities: Poisson(X ≤ 5 | λ = 13) = 0.011 for all gene transfers, Poisson(X ≤ 4 | λ = 6) = 0.285 for DNA translocations, and Poisson(X = 0 | λ = 5.5) = 0.004 for retrocopies. Note that the latter test remained significant when treating the ambiguous GeTY in D. melanogaster as a retrocopy: Poisson(X ≤ 1 | λ = 5.5) = 0.027. No significant differences were observed between D. melanogaster and the D. simulans clade for the detected functional GeTYs: Poisson(X ≤ 1 | λ = 1.5) = 0.56.

Supplementary Material

Supplementary File

Acknowledgments

We thank E. Hellmich, C. Pegueroles-Queralt, R. Sommer, B. Ballard, A. Paaby, B. Sebnem Onder, J. F. Garcia, M. Ofner, M. Puchinger, T. Little, M. Imhof, and C. Niessinger for collecting or providing the fly stocks used in this study. We are grateful to Nicola Palmieri for providing updated versions of the D. simulans and D. mauritiana reference genomes and the team at the Vienna BioCenter Core Facilities (VBCF) Next Generation Sequencing (NGS) Unit (www.vbcf.ac.at/home/) for performing part of the Illumina sequencing for this study. We thank Andy Clark for helpful discussions on data interpretation and acknowledge the insightful comments from two anonymous reviewers that led to further improvements in the manuscript. We thank the National Human Genome Research Institute (NHGRI)-funded Encyclopedia of DNA Elements Consortium for providing some of the testis data used in this study and the Brian Oliver laboratory for generating these data. This work was supported by a PhD fellowship (to R.T.) from the Vetmeduni Vienna, the Austrian Science Fund (W1225-B20), and the European Research Council grant ArchAdapt (to C.S.).

Footnotes

The authors declare no conflict of interest.

This article is a PNAS Direct Submission.

Data deposition: Sequence data are available from the European Nucleotide Archive (ENA) (accession no. PRJEB22850). Custom scripts used for analyses are available on Dryad (doi:10.5061/dryad.8ph59).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1706502114/-/DCSupplemental.

References

  • 1.Bernardo Carvalho A, Koerich LB, Clark AG. Origin and evolution of Y chromosomes: Drosophila tales. Trends Genet. 2009;25:270–277. doi: 10.1016/j.tig.2009.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Carvalho AB. Origin and evolution of the Drosophila Y chromosome. Curr Opin Genet Dev. 2002;12:664–668. doi: 10.1016/s0959-437x(02)00356-8. [DOI] [PubMed] [Google Scholar]
  • 3.Carvalho AB, Clark AG. Efficient identification of Y chromosome sequences in the human and Drosophila genomes. Genome Res. 2013;23:1894–1907. doi: 10.1101/gr.156034.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Bachtrog D. Y-chromosome evolution: Emerging insights into processes of Y-chromosome degeneration. Nat Rev Genet. 2013;14:113–124. doi: 10.1038/nrg3366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Garrigan D, et al. Genome sequencing reveals complex speciation in the Drosophila simulans clade. Genome Res. 2012;22:1499–1511. doi: 10.1101/gr.130922.111. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Tamura K, Subramanian S, Kumar S. Temporal patterns of fruit fly (Drosophila) evolution revealed by mutation clocks. Mol Biol Evol. 2004;21:36–44. doi: 10.1093/molbev/msg236. [DOI] [PubMed] [Google Scholar]
  • 7.Hall AB, et al. Six novel Y chromosome genes in Anopheles mosquitoes discovered by independently sequencing males and females. BMC Genomics. 2013;14:273. doi: 10.1186/1471-2164-14-273. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Dyer KA, White BE, Bray MJ, Piqué DG, Betancourt AJ. Molecular evolution of a Y chromosome to autosome gene duplication in Drosophila. Mol Biol Evol. 2011;28:1293–1306. doi: 10.1093/molbev/msq334. [DOI] [PubMed] [Google Scholar]
  • 9.Carvalho AB, Vicoso B, Russo CAM, Swenor B, Clark AG. Birth of a new gene on the Y chromosome of Drosophila melanogaster. Proc Natl Acad Sci USA. 2015;112:12450–12455. doi: 10.1073/pnas.1516543112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Long M, VanKuren NW, Chen S, Vibranovski MD. New gene evolution: Little did we know. Annu Rev Genet. 2013;47:307–333. doi: 10.1146/annurev-genet-111212-133301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Chen S, Krinsky BH, Long M. New genes as drivers of phenotypic evolution. Nat Rev Genet. 2013;14:645–660. doi: 10.1038/nrg3521. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Koerich LB, Wang X, Clark AG, Carvalho AB. Low conservation of gene content in the Drosophila Y chromosome. Nature. 2008;456:949–951. doi: 10.1038/nature07463. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Zhou Q, et al. On the origin of new genes in Drosophila. Genome Res. 2008;18:1446–1455. doi: 10.1101/gr.076588.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Bai Y, Casola C, Feschotte C, Betrán E. Comparative genomics reveals a constant rate of origination and convergent acquisition of functional retrogenes in Drosophila. Genome Biol. 2007;8:R11. doi: 10.1186/gb-2007-8-1-r11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Zhang YE, Vibranovski MD, Krinsky BH, Long M. Age-dependent chromosomal distribution of male-biased genes in Drosophila. Genome Res. 2010;20:1526–1533. doi: 10.1101/gr.107334.110. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Singh ND, Koerich LB, Carvalho AB, Clark AG. Positive and purifying selection on the Drosophila Y chromosome. Mol Biol Evol. 2014;31:2612–2623. doi: 10.1093/molbev/msu203. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Carvalho AB, Clark AG. Y chromosome of D. pseudoobscura is not homologous to the ancestral Drosophila Y. Science. 2005;307:108–110. doi: 10.1126/science.1101675. [DOI] [PubMed] [Google Scholar]
  • 18.Li H, Durbin R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009;25:1754–1760. doi: 10.1093/bioinformatics/btp324. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Robertson G, et al. De novo assembly and analysis of RNA-seq data. Nat Methods. 2010;7:909–912. doi: 10.1038/nmeth.1517. [DOI] [PubMed] [Google Scholar]
  • 20.Keightley PD, Ness RW, Halligan DL, Haddrill PR. Estimation of the spontaneous mutation rate per nucleotide site in a Drosophila melanogaster full-sib family. Genetics. 2014;196:313–320. doi: 10.1534/genetics.113.158758. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Stanke M, Diekhans M, Baertsch R, Haussler D. Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics. 2008;24:637–644. doi: 10.1093/bioinformatics/btn013. [DOI] [PubMed] [Google Scholar]
  • 22.Yang Z. PAML 4: Phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  • 23.Miller SA, Dykes DD, Polesky HF. A simple salting out procedure for extracting DNA from human nucleated cells. Nucleic Acids Res. 1988;16:1215. doi: 10.1093/nar/16.3.1215. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Palmieri N, Nolte V, Chen J, Schlötterer C. Genome assembly and annotation of a Drosophila simulans strain from Madagascar. Mol Ecol Resour. 2014;15:372–381. doi: 10.1111/1755-0998.12297. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Brizuela BJ, Elfring L, Ballard J, Tamkun JW, Kennison JA. Genetic analysis of the brahma gene of Drosophila melanogaster and polytene chromosome subdivisions 72AB. Genetics. 1994;137:803–813. doi: 10.1093/genetics/137.3.803. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Nolte V, Pandey RV, Kofler R, Schlötterer C. Genome-wide patterns of natural variation reveal strong selective sweeps and ongoing genomic conflict in Drosophila mauritiana. Genome Res. 2013;23:99–110. doi: 10.1101/gr.139873.112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Pandey RV, Schlötterer C. DistMap: A toolkit for distributed short read mapping on a Hadoop cluster. PLoS One. 2013;8:e72614. doi: 10.1371/journal.pone.0072614. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Li H, et al. 1000 Genome Project Data Processing Subgroup The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25:2078–2079. doi: 10.1093/bioinformatics/btp352. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Kofler R, Pandey RV, Schlötterer C. PoPoolation2: Identifying differentiation between populations using sequencing of pooled DNA samples (Pool-Seq) Bioinformatics. 2011;27:3435–3436. doi: 10.1093/bioinformatics/btr589. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Agresti A. John Wiley & Sons; Hoboken, NJ: 2002. Categorical Data Analysis. [Google Scholar]
  • 31.Smit A, Hubley R, Green P. 2013 RepeatMasker Open-4.0. 2013–2015. Available at repeatmasker.org. Accessed May 2, 2015.
  • 32.Tobler R, et al. Massive habitat-specific genomic response in D. melanogaster populations during experimental evolution in hot and cold environments. Mol Biol Evol. 2014;31:364–375. doi: 10.1093/molbev/mst205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Orozco-terWengel P, et al. Adaptation of Drosophila to a novel laboratory environment reveals temporally heterogeneous trajectories of selected alleles. Mol Ecol. 2012;21:4931–4941. doi: 10.1111/j.1365-294X.2012.05673.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–829. doi: 10.1101/gr.074492.107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Wu TD, Nacu S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics. 2010;26:873–881. doi: 10.1093/bioinformatics/btq057. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Kent WJ. BLAT—The BLAST-like alignment tool. Genome Res. 2002;12:656–664. doi: 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Dennis G, Jr, et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003;4:P3. [PubMed] [Google Scholar]
  • 38.Huang W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009;4:44–57. doi: 10.1038/nprot.2008.211. [DOI] [PubMed] [Google Scholar]
  • 39.Assis R, Zhou Q, Bachtrog D. Sex-biased transcriptome evolution in Drosophila. Genome Biol Evol. 2012;4:1189–1200. doi: 10.1093/gbe/evs093. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Roy S, et al. modENCODE Consortium Identification of functional elements and regulatory circuits by Drosophila modENCODE. Science. 2010;330:1787–1797. doi: 10.1126/science.1198374. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Katz Y, et al. Quantitative visualization of alternative exon expression from RNA-seq data. Bioinformatics. 2015;31:2400–2402. doi: 10.1093/bioinformatics/btv034. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Thorvaldsdóttir H, Robinson JT, Mesirov JP. Integrative Genomics Viewer (IGV): High-performance genomics data visualization and exploration. Brief Bioinform. 2013;14:178–192. doi: 10.1093/bib/bbs017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Innan H, Kondrashov F. The evolution of gene duplications: Classifying and distinguishing between models. Nat Rev Genet. 2010;11:97–108. doi: 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]
  • 44.Hill T, Schlötterer C, Betancourt AJ. Hybrid dysgenesis in Drosophila simulans associated with a rapid invasion of the P-element. PLoS Genet. 2016;12:e1005920. doi: 10.1371/journal.pgen.1005920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Kofler R, Nolte V, Schlötterer C. Tempo and mode of transposable element activity in Drosophila. PLoS Genet. 2015;11:e1005406. doi: 10.1371/journal.pgen.1005406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005;6:31. doi: 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Loytynoja A. Phylogeny-aware alignment with PRANK. Methods Mol Biol. 2014;1079:155–170. doi: 10.1007/978-1-62703-646-7_10. [DOI] [PubMed] [Google Scholar]
  • 48.Löytynoja A, Goldman N. An algorithm for progressive multiple alignment of sequences with insertions. Proc Natl Acad Sci USA. 2005;102:10557–10562. doi: 10.1073/pnas.0409137102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodol. 2002;64:479–498. [Google Scholar]
  • 50.Rogers RL, Shao L, Sanjak JS, Andolfatto P, Thornton KR. Revised annotations, sex-biased expression, and lineage-specific genes in the Drosophila melanogaster group. G3 (Bethesda) 2014;4:2345–2351. doi: 10.1534/g3.114.013532. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Chen ZX, et al. Comparative validation of the D. melanogaster modENCODE transcriptome annotation. Genome Res. 2014;24:1209–1223. doi: 10.1101/gr.159384.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES