From De Novo to “De Nono”: The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates

Claudio Casola

doi:10.1093/gbe/evy231

. 2018 Oct 22;10(11):2906–2918. doi: 10.1093/gbe/evy231

From De Novo to “De Nono”: The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates

Claudio Casola ^1,^✉

Editor: George Zhang

PMCID: PMC6239577 PMID: 30346517

Abstract

The evolution of novel protein-coding genes from noncoding regions of the genome is one of the most compelling pieces of evidence for genetic innovations in nature. One popular approach to identify de novo genes is phylostratigraphy, which consists of determining the approximate time of origin (age) of a gene based on its distribution along a species phylogeny. Several studies have revealed significant flaws in determining the age of genes, including de novo genes, using phylostratigraphy alone. However, the rate of false positives in de novo gene surveys, based on phylostratigraphy, remains unknown. Here, I reanalyze the findings from three studies, two of which identified tens to hundreds of rodent-specific de novo genes adopting a phylostratigraphy-centered approach. Most putative de novo genes discovered in these investigations are no longer included in recently updated mouse gene sets. Using a combination of synteny information and sequence similarity searches, I show that ∼60% of the remaining 381 putative de novo genes share homology with genes from other vertebrates, originated through gene duplication, and/or share no synteny information with nonrodent mammals. These results led to an estimated rate of ∼12 de novo genes per million years in mouse. Contrary to a previous study (Wilson BA, Foy SG, Neme R, Masel J. 2017. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat Ecol Evol. 1:0146), I found no evidence supporting the preadaptation hypothesis of de novo gene formation. Nearly half of the de novo genes confirmed in this study are within older genes, indicating that co-option of preexisting regulatory regions and a higher GC content may facilitate the origin of novel genes.

Keywords: de novo genes, synteny, gene age

Introduction

Protein-coding genes can emerge through mechanisms varying from gene duplication to horizontal transfer and the “domestication” of transposable elements, all of which involve pre-existing coding regions. Conversely, the process of de novo gene formation consists of the evolution of novel coding sequences from previously noncoding regions, thus generating entirely novel proteins. The discovery of de novo genes is facilitated by extensive comparative genomic data of closely related species and their accurate gene annotation. Because of these requirements, de novo genes have been mainly characterized in model organisms such as Saccharomyces cerevisiae (Carvunis et al. 2012; Lu et al. 2017; Vakirlis et al. 2017), Drosophila (Begun et al. 2007; Reinhardt et al. 2013; Zhao et al. 2014), and mammals (Heinen et al. 2009; Knowles and McLysaght 2009; Li et al. 2010; Murphy and McLysaght 2012; Neme and Tautz 2013, 2016; Ruiz-Orera et al. 2015; Guerzoni and McLysaght 2016).

Even among model organisms, the identification of de novo genes remains challenging. One major caveat in de novo gene discovery is their association with signatures of biological function. Because de novo genes described thus far tend to be taxonomically restricted to a narrow set of species, their functionality is not obvious as for older genes that are conserved across multiple taxa. Transcription and translation are considered strong evidence of de novo genes functionality, although the detection of peptides encoded from putative coding sequences is not always an indication of protein activity (Xu and Zhang 2016). Given their limited taxonomic distribution, testing sequence conservation and selective regimes on de novo genes coding sequences is often unachievable. Population genomic data can provide evidence of purifying selection on de novo genes but are thus far limited to a relatively small number of species (Zhao et al. 2014; Chen et al. 2015).

An important signature of de novo gene evolution is the presence of enabler substitutions, which are nucleotide changes that alter a “proto-genic” DNA region to facilitate its transcription and/or ability to encode for a protein (Vakirlis et al. 2017). Enabler substitutions can be recognized only by comparing de novo genes with their ancestral noncoding state, which can be identified by analyzing syntenic regions in the genome of closely related species that do not share such substitutions (Guerzoni and McLysaght 2016). Putative de novo genes with no detectable synteny in other species have likely originated through other processes, including duplication of preexisting genes, horizontal gene transfer, or transposon insertion, and subsequent domestication. Loss of synteny may also arise through deletion of orthologous genes in other species; however, this scenario appears improbable when comparing a large number of genomes. Genomic regions with complex rearrangements or the rapid sequence evolution of noncoding sequences could also erode synteny information to the point that orthologous regions of de novo genes might not be identifiable in sister genomes. However, these mechanisms are unlikely to play an important role in closely related species.

A third fundamental hallmark of de novo genes is the lack of homology of their proteins with proteins from other organisms. This feature is often the first step in comparative genome-wide surveys aimed at identifying putative de novo genes in a given species or group of species. Similarly, de novo proteins should be devoid of functional domains that occur in older proteins.

Previous studies on de novo gene evolution show a range of complexity in the strategies used to characterize these genes. A common approach to assess the evolutionary age of genes in known as phylostratigraphy and it has often been applied to retrieve a set of candidate de novo genes. This method relies on homology searches, usually consisting of BLAST surveys of a species proteome, allowing the inference of a putative “age” of each gene along the phylogeny of a group of species (Domazet-Loso et al. 2007). For example, mouse proteins that share homology with sequences in the rat proteome, but are absent in other mammals, would be categorized as “rodent-specific.” Notably, genes that appear to be lineage-specific may represent de novo genes, but could also be derived from any of the other processes mentioned above.

Phylostratigraphy is theoretically simple and effective, yet it is known to contain several methodological flaws (Elhaik et al. 2006; Moyers and Zhang 2015, 2016 , 2017; Smith and Pease 2017). For instance, both rapid sequence evolution and short coding sequences lead to underestimating gene age (Moyers and Zhang 2015). This increases the likelihood for rapidly evolving genes to be erroneously recognized as novel species-specific genes when phylostratigraphy-only approaches are used. It is known that after gene duplication one or both of the two copies may experience accelerated sequence evolution, which may result in an underestimate of their age. In agreement with this observation, a recent study in primates has shown that genes that evolve faster also tend to duplicate more (O'Toole et al. 2018). Importantly, phylostratigraphic studies that ignore synteny data will be unable to provide evidence of enabler substitutions and thus cannot conclusively prove de novo emergence.

It is perhaps not surprising that researches chiefly based on phylostratigraphy have led to estimates of de novo gene formation rates that exceeds or are comparable to those of gene duplication. For instance, it has been suggested that the S. cerevisiae genome contains hundreds of de novo genes that emerged during the Ascomycota evolution and that at least 19 genes are S. cerevisiae-specific, compared with a handful of gene duplicates found only in S. cerevisiae (Carvunis et al. 2012). Similarly, a relatively recent study reported that 780 novel genes emerged in mouse because its separation from the Brown Norway rat around 12 Ma, at a rate of 65 genes/Myr (Neme and Tautz 2013). Notably, the authors of this research do not directly distinguish between de novo genes and recent gene duplicates. However, in several sentences throughout this paper they appear to endorse the view that mouse-specific genes represent de novo genes by stating for example that “since the times considered for these youngest lineages are too short for the duplication-divergence model to apply” (Neme and Tautz 2013). According to these estimates, de novo genes represent about half of all mouse young genes, the other half being formed by gene duplicates. Because the overall gene number did not appear to have increased significantly during mammal and yeast evolution, such a high pace of de novo gene formation must be accompanied by rampant levels of gene loss. If true, this would represent a “gene turnover paradox,” given that most genes are maintained across mammals. For instance, according to the Mouse Genome Database, 17,093/22,909 (∼75%) protein-coding mouse genes share homology with human genes (Blake et al. 2017).

There are two possible resolutions to the “gene turnover paradox”: Either most de novo genes are rapidly lost after their origin or the rate of de novo gene birth is overestimated. The first scenario is supported by recent works based on mouse transcriptomic and ribosome profiling data (Neme and Tautz 2016; Schmitz et al. 2018). Other studies lend credit to the idea that the rate of de novo gene formation may be biased upward. For example, Moyers and Zhang used simulations to show that gene age is underestimated in a significant proportion of cases based on phylostratigraphy (Moyers and Zhang 2015, 2016 , 2017). These authors pointed out that most putative S. cerevisiae-specific de novo genes overlap with older genes and show no signature of selection operating on their coding sequence (Moyers and Zhang 2016). These studies have proved critical to address major pitfalls of phylostratigraphy. However, the exact proportion of false positives in de novo gene studies remains unknown and it is unclear how many putative de novo genes should instead be considered fast evolving genes. A correct assessment of de novo genes is critical to establish their evolutionary history and more broadly to identify genomic features, if any, that may facilitate the emergence of novel genes.

Here, I address these issues by reanalyzing putative mouse de novo genes from three articles (Murphy and McLysaght 2012; Neme and Tautz 2013; Wilson et al. 2017) using a combination of sequence similarity searches and synteny information. I show that more than half of the 874 putative de novo genes previously described in mouse are absent in current versions of three major mouse gene annotation databases, an indication of how gene annotation volatility can affect de novo gene studies even among model organisms. Of the remaining putative de novo genes, only ∼40% could be validated. The dismissed putative de novo genes either shared homology with genes found in multiple nonrodent vertebrates, derived from duplication of pre-existing mouse genes, and/or lacked synteny information with nonrodent mammals. I collectively refer to the putative de novo genes that failed to pass the validation criteria as the “de nono” genes. These findings also indicate that false positives in phylostratigraphy studies of de novo genes exceed previous estimates of type I error rates based on simulations in S. cerevisiae (Moyers and Zhang 2016). Contrary to what was suggested in a recent study (Wilson et al. 2017), I found no evidence of preadaptation in the validated mouse de novo genes. Instead, I observed that the trend reported by Wilson and collaborators, an inverse correlation between intrinsic structural disorder (ISD) of proteins and gene age suggestive of a lower tendency toward aggregation in proteins encoded by younger genes, is primarily due to high ISD levels in de novo genes whose coding region overlap exons of older genes.

Materials and Methods

Putative De Novo Genes (PDNGs)

Murphy and McLysaght (2012): Mouse de novo gene IDs were retrieved from table 1 of the Murphy and McLysaght study (Murphy and McLysaght 2012). These genes were found using the Ensembl version 56. Protein-coding genes from the two closest available Ensembl versions, v54 and v67, were downloaded from the Ensembl archives (https://www.ensembl.org/info/website/archives/index.html). Out of the 69 putative mouse de novo genes, only 26 were still annotated as protein-coding genes in v67.

Table 1.

Original PDNG Sets, Currently Annotated PDNGs, and De Novo Genes Assessed in This Study

	M2012	N2013	W2017	Total
Mouse PDNG	69	773	84	874
Annotated in mm10	9	331	72	381
De novo genes	3 (7)	74 (139)	13 (18)	82 (152)

Open in a new tab

Note.—Numbers in parenthesis refer to de novo genes remaining when genes from automatic annotation pipelines are included.

Neme and Tautz (2013): Neme and Tautz identified de novo gene using the mouse Ensembl version 66. We retrieved all the 80,007 mouse transcript and protein IDs and sequences annotated in the closest available data set, the archived Ensembl version 67. We found a match for 779 out of 780 mouse putative de novo genes in the Ensembl v67 version and selected the longest protein isoform for these genes for subsequent analyses. Six PDNGs were removed from the 779 gene set after applying a minimum protein length threshold of 30 amino acids leading to a total of 773 analyzed PDNGs in the N2013 data set. The presence of N2013 PDNGs in the mouse Ensembl v54 proteome was determined using a tBLASTn search with an e-value threshold of 0.001. Hits that shared at least 90% sequence identity over at least half of the query were considered orthologous sequences.

Wilson et al. (2017): Ensembl gene IDs and sequences of the 84 mouse young genes were obtained from supplementary table 2 of Wilson et al. (2017). Transcript and protein IDs and sequences annotated in the closest available data set, the archived Ensembl version 75 (https://www.ensembl.org/info/website/archives/index.html), the same used in the W2017 paper.

Updated Annotation of PDNGs

The UCSC Genome Browser and Table Browser have been used to retrieve genome coordinates of PDNGs from the three papers’ data sets (http://genome.ucsc.edu/cgi-bin/hgTables). However, the Ensembl gene track from the most recent mouse genome assembly (GRCm38/mm10, December 2011) does not contain all Ensembl IDs corresponding to PDNGs. Genome coordinates of PDNGs were therefore retrieved using the previous mouse genome assembly (NCBI37/mm9, July 2007). These coordinates were then transformed into coordinates of the mouse mm10 assembly using the LiftOver tool (http://genome.ucsc.edu/cgi-bin/hgLiftOver). The genome coordinates of most PDNG coding exons were successfully lifted to the mm10 assembly and uploaded to the Galaxy portal (https://usegalaxy.org). Genome coordinates of the coding exons of mouse RefSeq, GENCODE M16 (same gene set as Ensembl v91) and UCSC “known” genes were also uploaded to Galaxy. PDNG coding exons were then joined using the Galaxy tool “Join” in the Menu “Operate on Genomic Intervals” applying “All records of first data set” to return their overlap with coding exons of each of the three gene sets with a minimum of 50 bp overlap (all PDNGs had at least one coding exon longer than 50 bp). Coding exons of 381 PDNGs overlapped with coding exons of at least one gene from the three used gene sets. Note that overlapping genes on opposite strands of PDNGs were not excluded. PDNGs whose IDs were not available in the mm9 Ensembl track were reannotated by querying their coding sequences against the mm10 genome assembly using blat (http://genome.ucsc.edu/cgi-bin/hgBlat) to visually find matches with annotated GENCODE M16, RefSeq, and UCSC known genes. The M2012 and W2017 PDNGs were visually inspected on the UCSC Genome Browser for overlap with annotated genes. Annotation was confirmed for 9/69 M2012 PDNGs (table 1).

Sequence Similarity Analyses to Identify Paralogs

Several types of BLAST searches on multiple mouse databases were carried out using a consistent e-value threshold of 0.001. The mouse genome assembly mm10 was searched locally using tBLASTn (Camacho et al. 2009). In searches against the mouse genome only hits against the coding sequence of known genes were considered valid paralogous genes of putative de novo genes. The combined mouse proteomes from GENCODE M16 genes (Harrow et al. 2012), the RefSeq genes (O'Leary et al. 2016), and the UCSC Genome Browser “known genes” (http://genome.ucsc.edu/cgi-bin/hgTables) databases were searched locally using BLASTP. BLAST results were parsed and filtered using perl scripts and Unix commands. Matches over <50% of the query sequence were removed to increase stringency. Matches of PDNGs with multiple proteins were carefully inspected to ensure that they corresponded to multiple loci rather than alternative transcripts of the same gene. Alignments of PDNGs with a single match were also inspected to determine if these hits represented paralogous genes rather than self-hits.

Similar searches were performed locally using the algorithm phmmer in the HMMER suite (http://hmmer.org/). Each PDNG protein set was queried against the three combined GENCODE, RefSeq, and UCSC proteomes using default parameters. Results were visually inspected to identify matches between protein sets. Hits with c-Evalue and i-Evalue below 0.001 were considered positive matches.

Sequence Similarity Analyses to Identify Homologs in Vertebrates

The vertebrate (taxid:7742) NCBI nr protein database was interrogated between September 2017 and January 2018 in the NCBI BLAST portal (https://blast.ncbi.nlm.nih.gov/Blast.cgi) using default settings except an e-value threshold of 0.001, excluding Rodents (taxid:9989). The reference proteomes database (https://www.ebi.ac.uk/reference_proteomes) was interrogated between September 2017 and January 2018 using phmmer with a higher than default stringency e-value of 1e⁻⁰⁵ and excluding rodents from the search (https://www.ebi.ac.uk/Tools/hmmer/search/phmmer). Only PDNG proteins with significant hits with proteins from at least two vertebrates were considered positive matches in both BLAST and phmmer searches.

Synteny Analyses

Mouse genome coordinates in BED format of the PDNGs coding exon were retrieved from the UCSC Genome Browser table browser tool (http://genome.ucsc.edu/cgi-bin/hgTables) using either Ensembl identifiers or novel RefSeq/UCSC identifiers from the reannotation of the three data sets (supplementary tables S1 and S2, Supplementary Material online). The BED coordinates were then sent to the Galaxy portal (https://usegalaxy.org).

I generated a workflow on Galaxy (https://usegalaxy.org/u/claudiocasola/w/maf-blocks-for-mouse-mm10-sequences) to obtain MAF blocks (Multiple Alignment Format blocks) from aligned sequences in the mouse genome assembly mm10 (Blankenberg et al. 2011). Briefly, the workflow utilizes genome coordinates to extract MAF blocks from the 100-way multiZ alignment based on the human genome assembly hg19. Overlapping MAF blocks were merged, filtered to retain only mouse blocks, and joined to the coordinates of each coding exon of the putative de novo genes. A few remaining overlapping MAF blocks were manually removed from the MAF data sets.

Protein Domain Analyses

The NCBI Conserved Domain repository (Marchler-Bauer et al. 2017) was interrogated with proteins encoded by PDNGs between September 2017 and January 2018 using default parameters except inclusion of retired sequences in the batch search portal (https://www.ncbi.nlm.nih.gov/Structure/bwrpsb/bwrpsb.cgi). Conserved domains of PDNGs with no evidence paralogy and lack of synteny were also searched throughout the InterPro server (https://www.ebi.ac.uk/interpro/) in March 2018.

Gene Structure

Gene length, coding length, intron length, UTRs length, and exon number of mouse genes were obtained from the GENCODE M16 data set through the UCSC Table Browser. The 64,506 transcripts were filtered to remove noncoding sequences and genes with only automatic annotation. Transcripts matching PDNGs were also removed. All except the shortest transcripts of the remaining genes were removed, leaving 20,470 genes.

Quality of Gene Annotation

The annotation quality of validated de novo gene was assessed by retrieving data from the GENCODE M16 Basic gene set and the UCSC known genes from the UCSC Table Browser. GENCODE transcript support levels range from 1 (all splice junctions of the transcript are supported by at least one nonsuspect mRNA) to NA (the transcript was not analyzed). Additionally, high-quality GENCODE genes are manually annotated in HAVANA (https://www.gencodegenes.org/gencodeformat.html). Forty-two de novo genes with either no HAVANA ID or transcript support level equal NA were considered low-quality genes. Similarly, 31 UCSC known transcript that has not been reviewed or validated by the RefSeq, SwissProt, or CCDS staff were considered low quality.

Protein Disorder and Protein Aggregation Analyses

The software PASTA 2.0 (Walsh et al. 2014) with default settings was used to estimate ISD in proteins encoded by the three PDNG sets and 20,391 non-de novo Ensembl v91 proteins, including overlapping non-de novo proteins (see below).

Estimates of Gene Duplication Rates

Gene duplication events estimated to have occurred in mouse because its divergence from the Brown Norway rat have been obtained from Worley et al. (2014). Gene duplications and losses were modeled using the maximum-likelihood framework implemented in the CAFE package (Han et al. 2013). A total of 1,052 mouse-specific gene duplications were calculated based on gene family data totaling 18,215 genes (supplementary fig. 7 in Worley et al. 2014). Assuming ∼22,000 genes in the mouse genome, the actual overall number of gene duplicates is 1,275, leading to a rate of duplication of ∼106/Myr.

Overlap Between Genes

Each validated de novo gene was visually inspected through the UCSC Genome Browser to identify possible overlap with other genes. To find the genome-wide proportion of overlapping genes, transcripts from all genes in the mouse GENCODE M16 basic gene set were downloaded from the UCSC Table Browser. Transcripts from the mitochondrial genome, nonprotein-coding transcripts and transcripts from de novo genes were removed. For each remaining gene, only the longer transcript was retained, leaving a total of 22,396, of which 1,876 (∼8.4%) overlapped other genes.

All overlapping regions between genes in the mouse GENCODE M16 gene set were retrieved by from the gtf file downloaded from the UCSC Table Browser. After removing all entries except coding exons, overlap between genes was determined by sorting and comparing genome coordinates. The 555 overlapping exons were then inspected by their Ensembl names and their genome coordinates to remove the de nono and de novo candidates listed in supplementary tables S1 and S2, Supplementary Material online. Overlapping instances shorter than 50 bp were also removed. The DNA sequences of the remaining unique 206 exons were downloaded using the Custom Track, sequence output format from the UCSC Table Browser. Each of the six frames of these sequences was translated using the EMBOSS Transeq tool available at https://www.ebi.ac.uk/Tools/st/emboss_transeq/. These sequences were then matched with the full-length protein sequences of the 206 genes with overlap using BLASTP with default parameters except e-value=0.1. After removing translated overlapping exons with <100% identity with the full-length proteins, a total of 121 unique protein sequences encoded by overlapping exons remained.

Rat De Novo Gene Orthologs

The genome coordinates of the 152 validated de novo genes from the mouse assembly mm10 were used to retrieve syntenic regions in the rat rn6 assembly with the LiftOver tool in the UCSC Genome Browser. A total of 29,107 Ensembl and transcript coding regions were downloaded using the UCSC Table Browser (data last updated: June 9, 2017). Proteins and CDS were searched against the retrieved rat genomic regions syntenic with mouse de novo genes using tBLASTn, e-value threshold=0.001. The BLAST results were parsed using an in-house perl script and filtered to retain hits longer than 30 bp and with at least 97% DNA sequence identity between CDS and genome. This step left 67 putative orthologous proteins to mouse de novo genes. Some of these proteins were orthologous to proteins that in mouse overlap to the validated de novo genes. Thus, I manually inspected these proteins against the mouse mm10 assembly using BLAT in the UCSC Genome Browser and by running a BLASTP search between the 152 mouse de novo proteins and the 67 candidate rat orthologs. The same approach was used to retrieve 17,619 rat RefSeq proteins and CDS. I obtained 168 candidates that were screened against the 133 mouse validated de novo genes. Additionally, I searched the combined 235 candidate Ensembl and RefSeq proteins against the 152 mouse de novo proteins using phmmer locally with default settings.

Results and Discussion

Putative De Novo Gene Annotation Status

In this study, I reanalyzed putative de novo genes (hereafter: PDNGs) from three articles focused on rodent genomes (Murphy and McLysaght 2012; Neme and Tautz 2013; Wilson et al. 2017). Hereafter, I will refer to these works as M2012 (Murphy and McLysaght 2012), N2013 (Neme and Tautz 2013), and W2017 (Wilson et al. 2017). I specifically focused on mouse-specific genes from the M2012 and N2013 studies, and on the rodent-specific genes from the W2017 study. Similarly to the N2013 paper, in the W2017 article “young genes” are not clearly separated into gene duplicates or de novo genes. However, in several sentences throughout this article Wilson et al. suggest that these young genes are mostly or entirely the result of novel gene birth, as also indicated by the title of their article “Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth.” Throughout this paper, I will consider “young genes” from both the N2013 and the W2017 articles as PDNGs.

A total of 491 previously reported rodent PDNGs, particularly those from the M2012 and N2013 studies, are not annotated as protein-coding genes in the updated versions of three major mouse gene annotation databases: GENCODE M16, RefSeq, and UCSC Genome Browser “known” genes (table 1). This is expected given that the three studies were based on less-well curated gene sets. Surprisingly, the N2013 and W2017 data sets contained several PDNGs lacking a start codon. These genes were excluded from further analysis in this work. The final count of PDNGs with confirmed annotation as protein-coding genes was 9, 331, and 72 from the M2012, N2013, and W2017 studies, respectively (table 1). After excluding overlap between the three gene sets, I retrieved 381 PDNGs, which represent 44% of the 874 genes originally reported as de novo genes. One could argue that some true de novo genes are missing from the three gene databases analyzed in this study. For example, some authors have pointed out that a fraction of de novo genes may not be annotated by standard pipelines because they fall below the applied gene length threshold or are “invisible” to annotation algorithms that rely on sequence similarity with genes and proteins from other species (Tautz and Domazet-Loso 2011). However, the databases used here contain several genes with very short coding regions; for example, the gene Rpl41 encodes a 25 amino acid long protein and is annotated in GENCODE M16, RefSeq, and UCSC Genome Browser “known” genes set. Sequence similarity alone is not critical to gene annotation, except for the prediction of gene function; evidence of expression and intact coding sequences are essential to gene identification. Therefore, it is unlikely that these factors affect significantly the accuracy of de novo gene surveys. Below, I describe the four criteria I used to assess the proportion of PDNGs that represent “de nono” genes: Presence of paralogous genes in mouse (inparalogs); homology with genes found in multiple nonrodent vertebrates; lack of synteny information with nonrodent mammals; presence of conserved domains found in nonrodent proteins (figs. 1 and 2).

Fig. 1. — —Features distinguishing de novo and “de nono” genes. Rectangles, solid lines, and dashed lines represent genes, nongenic syntenic regions, and nonsyntenic regions, respectively. Presence of enabler substitutions (lightning bolts), absence of inparalogs and homologs in other species, conserved synteny and lack of conserved domains characterize de novo genes. Putative de novo genes that fail to conform to one or more of these criteria represent “de nono” genes. Myr, million years.

Fig. 2. — —Examples of de novo and “de nono” rodent genes visualized through the USCS Genome Browser. (a) An intergenic de novo gene with relatively low synteny conservation across several nonrodent mammals. (b) A de novo gene (blue asterisk) overlapping with the 3′UTR of an older gene. Notice that the gene symbol is the same for the two genes, which share no coding or protein similarity. (c) Summary of “de nono” genes features. (d) A “de nono” gene (*Cd52*, red asterisk) with conserved flanking genes in mouse (top) and human (bottom). (e) A tandem array of keratin-associated genes including three “de nono” genes (red asterisks). (f) A “de nono” gene with no synteny conservation beyond rat. Coding exons, UTRs and introns are shown as thick blue bars, thin blue bars and lines with arrows, respectively. When annotated, alternative transcripts are shown. The conservation track (green bars and single or double lines) represents the level of sequence identity between the highlighted mouse genomic region and its orthologous regions in other mammals, based on MultiZ alignments of 60 vertebrate genomes. The height of the green bars is proportional to the level of nucleotide sequence identity. The single line indicates no bases in the aligned species due to indels, whereas the double line shows regions with one or more unalignable bases. The pale yellow coloring indicates regions with Ns. The lack of any feature in the conservation track designates a region with no alignable bases and thus a complete lack of synteny conservation.

Sequence Similarity Analyses to Identify Homologous Genes in Nonrodent Genomes

According to the phylostratigraphic approach, proteins that are found in a given species/lineage and share no significant sequence similarity with proteins from other taxa must be encoded by “novel” genes, either de novo genes or duplicated copies of older genes (fig. 1). These sequences have been defined “orphan genes” or “taxonomically restricted genes” depending on the authors (Khalturin et al. 2009; Tautz and Domazet-Loso 2011). The assumption of “novelty” in orphan genes can be violated under two scenarios, which I discuss below with regard to PDNGs. First, novel proteins are routinely added to existing sequence databases, thus expanding the sequence space available to search for possible homologous sequences of PDNGs. To explore this possibility I carried out tBLASTn searches against the NCBI vertebrate nucleotide nonredundant database using mouse de novo proteins. Second, alternative sequence similarity search algorithms than those used in the original studies may reveal yet unrecognized homologs of PDNGs. For instance, profile-based approaches such as phmmer can be more accurate than nonprofile methods, including BLAST, in sequence homology searches (Saripella et al. 2016). A combination of these methods has recently been applied to detect de novo genes in 15 species from 2 yeast phyla (Vakirlis et al. 2017). I therefore interrogated a reference proteome database available through the EMBL-EBI phmmer server to identify PDNG homologs that are not recognized using BLASTP (see Materials and Methods). Finally, I visually inspected all PDNGs with synteny information to find possible orthologs in the human genome using the UCSC Genome Browser net-alignment track (Schwartz et al. 2003). Combining the results of both analyses I identified 98 PDNGs (26% of all PDNGs) with homologs in 2 or more vertebrate species (table 2), including 14 PDNGs with orthologous genes in human (fig. 2d; supplementary table S1, Supplementary Material online). As expected, alignments of proteins from some of these orthologs showed short regions of sequence conservation (supplementary fig. S1, Supplementary Material online).

Table 2.

Summary of Homology Searches and Synteny Analysis for the Three PDNG Sets

	M2012	N2013	W2017	Combined PDNGs
Inparalogs (BLAST)^a	0	63	27	81
Inparalogs (phmmer)^a	0	80	31	102
PDNGs w/ inparalogs	0	88	32	110
Homology in Vertebrates (BLAST)^b	1	60	5	62
Homology in Vertebrates (phmmer)^b	0	33	15	43
PDNGs w/ homologs in vertebrates	1	86	20	98
Presence of protein domain	1	23	19	39
Lack of synteny^c	1	110	30	131
Overall total	3	192	55	229

Open in a new tab

Significant similarity (BLAST: e-value ≤0.001; phmmer: i-Evalue ≤0.001) with GENCODE M16, RefSeq and/or UCSC Genome Browser mouse genes.

Homologous sequences (BLAST: e-value ≤0.001; phmmer: i-Evalue ≤0.001) found in at least two nonrodent vertebrate species.

No synteny conservation of PDNGs coding regions across mammals.

For five of these PDNGs with human orthologs, no inparalogs or vertebrate homologs were detected; however, genes with the same name were present in human and a visual inspection of their syntenic region, including nearby genes, confirmed that they were in synteny with the mouse PDNGs (supplementary fig. S2, Supplementary Material online).

Orthology relationships of PDNGs that were part of tandem arrays were not established given that some tandem arrays tend to experience high rates of gene turnover. Thus, the number of PDNGs with orthologs in human could be higher. In approximately one-third of these PDNGs (36/98) homology to nonrodent genes was uniquely detected through the phmmer search, indicating that a significant proportion of false positives in de novo surveys will be undetected using BLAST-only approaches.

Similarity Analyses to Identify Paralogous Genes in Mouse

Studies aiming at identifying de novo genes should be able to discriminate between them and the second possible type of orphan genes, young gene duplicates. This might be difficult to do in light of the fact that de novo genes, as any other type of genes, can undergo duplication and form novel gene families in a genome. However, it is reasonable to consider the formation of large gene families of de novo genes unlikely in the relatively short evolutionary time period since mouse–rat divergence, which occurred as late as ∼12 Ma (Kimura et al. 2015). Perhaps more noteworthy, a conservative approach should be applied to the study of de novo genes by discarding candidates with paralogous genes in the same genome, or inparalogs. To assess the frequency of PDNGs with inparalogs, I performed similarity searches based on tBLASTn, BLASTn, and phmmer. Using the same threshold commonly applied in phylostratigraphy studies with BLAST, namely a maximum e-value of 0.001, and stringent criteria for phmmer results (see Materials and Methods), I found that 110 genes (29% of all PDNGs) have at least one paralogous gene in GENCODE M16, RefSeq, or UCSC known gene murine data sets (fig. 2d and table 2). Notably, only 29/110 of these PDNGs formed gene families exclusively with other PDNGs (supplementary table S1, Supplementary Material online, last column). Most of these genes belong to large families and are active primarily or exclusively in the testis, an expression pattern often associated with low sequence conservation due to high substitution rates. Examples include the three genes encoding seminal vesicle secretory protein Svs4-6 and the four cysteine-rich perinuclear theca genes Cypt1-4. Two more genes, Gm19684 and Gm6034, encode for Class I Histocompatibility antigen, a group of genes also known to undergo rapid evolution. Thus, it appears that most de nono genes with inparalogs derive from duplication events of ancient genes. Nongenic homologous sequences to PDNGs were also found using BLASTn searches against the mm10 mouse genome assembly, but were not included in the validation of PDNGs.

Fifty-four “de nono” genes clustered in 20 tandem arrays, defined as groups of “de nono” genes <100 kb apart (fig. 2e;supplementary table S1, Supplementary Material online). At least one gene pair from each array was found using a 10 kb distance cutoff. Several lines of evidence suggested that these arrays were not entirely formed by de novo genes. Many arrays contained other paralogs that were not annotated as PDNGs in the first place (fig. 2e). Moreover, PDNGs in some arrays belonged to known gene families present in other mammals. For instance, two arrays contained keratin-associated genes and another one was found within a cluster of defensin genes. Finally, most “de nono” genes in arrays (37/54) showed no synteny conservation with other mammals, as expected in the presence of lineage-specific duplications rather than de novo gene formation. This finding underscores the importance of implementing more rigorous homology searches within the focal genome in de novo gene studies in order to remove false positives due to young gene duplicates. At the very least, clustering PDNGs in gene families allow to correctly estimate the number of novel gene birth events. For example, using this approach and extensive sequence similarity searches, Vakirlis et al. (2016) identified dozens taxonomically restricted (i.e., orphan) gene families in the yeast genus Lachancea with multiple inparalogs in one or multiple species.

Synteny Analyses

Synteny information in de novo gene investigations is crucial to detect enabler substitutions by comparing putative novel coding sequences with noncoding orthologous regions in sister taxa (McLysaght and Guerzoni 2015; Vakirlis et al. 2017). To assess synteny conservation of PDNGs, I used genome-wide alignment data available throughout the Galaxy portal. Approximately 34% PDNGs (131/381) exhibited no synteny conservation in a 60-vertebrate alignment, which includes 40 mammalian genomes (fig. 2f;supplementary table S1, Supplementary Material online). It is arguable that some of these “de nono” genes with no synteny information may represent true de novo genes that evolved in genomic region that have been lost through deletions in nonrodent mammals. However, this is unlikely given that the synteny information relies on alignments of a large number of mammalian genomes (Blankenberg et al. 2011). Given the phylogenetic distribution of the species present in the multialignment, loss of syntenic regions should have occurred independently in no less than three mammalian lineages. This also does not account for possible further losses of synteny within glires (lagomorphs and rodents), which were not assessed in this study.

Many “de nono” genes with no apparent orthologs outside rodents could represent lineage-specific copies of older parent genes. Indeed, 76/131 “de nono” genes with no synteny conservation shared homology with at least one inparalog. Some of the remaining 55 “de nono” genes with no synteny conservation may constitute rapidly evolving young gene duplicates with no detectable sequence similarity with their parent genes; for example, four of them belong to tandem arrays. As expected, the majority of “de nono” genes with conserved synteny across mammals (76/98) shared homology with genes found outside rodents.

Conserved Domains in PDNGs

Conserved domains were found in protein sequences encoded by 39 PDNGs (supplementary table S1, Supplementary Material online). As expected, some of these domains belong to gene families identified in tandem arrays, such as defensing and keratin. Some conserved domains were not functionally characterized; for instance, two PDNGs encoded peptides with domains of unknown functions and two other proteins contained a proline-rich domain. Arguably, these domains might belong to novel proteins present in multiple rodents. However, all PDNGs encoding proteins with less well-characterized domains failed to pass one or multiple other criteria to be considered valid “de novo” genes (supplementary table S1, Supplementary Material online).

New Estimates of Rodents De Novo Genes

The results of homology searches and synteny analysis showed that only 152 of the 381 PDNGs (∼40%) annotated as protein-coding genes represent de novo genes (fig. 2a and b). Notably, 70/152 (46%) of these PDNGs were automatically annotated and might represent pseudogenes (table 1). Excluding these genes from the annotation list brings down the number of validated PDNGs to 85/314 (∼27%). Moreover, this set of 152 de novo genes likely includes several “de nono” cases for three reasons. First, the criteria applied in the homology searches were particularly stringent. Nongenic paralogous sequences were excluded from inparalog searches and I required at least two nonrodent species to show significant similarity with PDNGs to identify “de nono” genes. A stringent threshold was also applied in the phmmer searches (see Materials and Methods). Second, several PDNGs showed synteny with nonrodent genomes for as little as 10% of their coding regions. Third, enabler substitutions have not been searched for in the N2013 and W2017 PDNG sets (Neme and Tautz 2013; Wilson et al. 2017). Some de novo genes are also likely absent in the three studies analyzed here. Accordingly, analyses in human and mouse have linked of lncRNAs to the origin of de novo genes, including many de novo genes not reported before (Xie et al. 2012; Chen et al. 2015; Ruiz-Orera et al. 2015; Neme and Tautz 2016).

Overall, errors in de novo gene detection depended on lack of synteny (131 genes, 34% of PDNGs), presence of mouse paralogous genes (110 genes, 29%), and/or homology with genes from at least two nonrodent vertebrates (98 genes, 26%), as summarized in table 2. In line with results from the reanalysis of budding yeast PDNGs (Carvunis et al. 2012; Moyers and Zhang 2016), these findings call for implementing more rigorous strategies to validate PDNGs identified through phylostratigraphy using a combination of synteny analysis and extensive sequence similarity searches. Both strategies are readily implemented using existing databases and software, particularly in model taxa with established synteny data. In nonmodel organisms, validation of PDNGs is more problematic given the paucity of multiple closely related genomes and genome-wide alignments necessary to retrieve synteny information.

The number of confirmed de novo genes varies significantly across the three studies. Seven out of ten still annotated PDNGs from the M2012 paper were validated in this reanalysis, compared with 139/331 (42%) and 18/72 (25%) PDNGs reported in the N2013 and the W2017 studies, respectively. The two latter works were based on a phylostratigraphy-only approach, whereas Murphy and McLysaght integrated phylostratigraphy with an analysis of synteny and enabler substitutions (Murphy and McLysaght 2012). The ten annotated PDNGs all showed enabler substitutions (Murphy and McLysaght 2012). In spite of a large number of detected false positives, that is, “de nono” genes, a significant difference remains in the number of validated de novo genes identified in the three studies. This is especially striking in the M2012 and N2013 studies, which focused on mouse-specific genes. Two factors seem to have contributed to the observed discrepancy between these works. First, the two analyses relied on different versions of the mouse Ensembl gene set, v56 and v66. Although both versions have been discontinued, the closest available data sets from v54 and v67 differ significantly in the number of annotated protein sequences (v54: 40,341; v67: 80,007). However, 108/117 validated de novo genes from the N2013 data set were already present in the mouse Ensembl v54 proteome. More importantly, Murphy and McLysaght developed a pipeline incorporating several stringent filtering steps that appear to be absent in the Neme and Tautz work (Murphy and McLysaght 2012; Neme and Tautz 2013). Specifically, Murphy and McLysaght analyzed only mouse PDNGs with orthologous noncoding regions in rat and showed experimental evidence of both transcription and translation. None of these criteria were applied in the Neme and Tautz study. Therefore, the N2013 PDGNs set contains genes with weak annotation support and genes lacking synteny data. Thus, the 142 validated de novo genes from the N2013 study represent an upper boundary of the number of potential mouse de novo genes, with the caveats that some of them might be present also in the rat genome, but could not be detected in BLAST searches due to high levels of divergence. More than three-fourths of PDNGs from the W2017 study have been reclassified as “de nono” genes in this study. The majority of these genes (43/54) showed significant homology with other vertebrates. Seven of them correspond to human functional orthologs and the pseudogene Snhg11 (fig. 2d;supplementary fig. S2, Supplementary Material online).

Rate of De Novo Gene Formation in Mouse

In their 2013 paper, Neme and Tautz identified 780 mouse-specific PDNGs that, given their apparent absence in rat, would have emerged in the past ∼12 Myr (Kimura et al. 2015). This corresponds to a rate of ∼65 genes/Myr, which is similar to mouse-specific gene duplications estimates of 63 genes/Myr obtained comparing mouse and Brown Norway rat (Gibbs et al. 2004) and 106 genes/Myr assessed using a phylogeny of 10 complete mammalian genomes (Worley et al. 2014) (see Materials and Methods). To recalculate the rate of de novo gene formation in mouse according to Myr analysis, I first determined the orthology of the 152 validated de novo genes in the rat genome. Only thirteen de novo genes shared similarity to rat proteins (see Materials and Methods). Thus, 139/152 validated de novo genes appear to be mouse-specific. This result implies that the maximum rate of de novo gene formation during the mouse lineage evolution correspond to ∼11.6 gene/Myr, assuming a mouse–rat divergence time of ∼12 Myr. This indicates that de novo genes in mouse emerged at a pace that is at least about 5.4–9.1 times slower compared with gene duplicates. Notably, a recent study has shown that de novo genes originated at only ∼2.1 gene/Myr in the great apes (Guerzoni and McLysaght 2016). Lower numbers of de novo genes were identified in mouse by one of these authors in the M2012 paper, suggesting that methodological differences might be largely responsible for the discrepancy in the estimates of de novo gene formation between rodents and primates.

Characteristics of Mouse De Novo Genes

The 152 validated de novo genes can be divided in two groups according to the quality of their annotation. The manually annotated group contained 82 de novo genes, compared with 70 automatically annotated genes (supplementary table S2, Supplementary Material online). I will refer to these two groups as MA and AA, respectively. Gene length and number of exons both increased significantly from the AA group to the MA group and for both of them in comparison to 20,391 other mouse transcripts (supplementary table S3, Supplementary Material online). These features have been observed in previous de novo gene studies, including some of the data re-examined here (Murphy and McLysaght 2012; Neme and Tautz 2013; Guerzoni and McLysaght 2016).

Seventy-one de novo genes overlapped with coding exons (20), 5′UTRs (19), 3′UTRs (9), or introns (27) of older genes (supplementary table S4, Supplementary Material online). In this paper, I will refer to de novo genes that overlap with the coding region of older genes but use any of the five alternative frames as “overprinted,” following the operational definition of overprinting used in virus genomics (Pavesi et al. 2013). Except for the 3′UTR cases, most de novo genes overlapped on the opposite strand of the older gene (supplementary table S4, Supplementary Material online). Given that coding exons occupy only ∼1% of mammalian genomes, the occurrence of 20 de novo genes in overlap with coding exons is especially notable. The emergence of novel ORFs on the complementary strand of preexisting genes is relatively common in viruses but is considered rare among eukaryotes (Chung et al. 2007; Pavesi et al. 2013). In bacteria, long ORFs tend to be present on the opposite strand of genes (Yomo and Urabe 1994). The presence of widespread long ORFs on the opposite strand of mammalian coding exons could thus accelerate the origin of de novo genes.

Overall, de novo genes tend to overlap with other genes almost six times more often than older genes (46.7% vs. 8.4%; P <0.00001, Fisher exact test). This tendency has been documented in rodents and primates (Knowles and McLysaght 2009; Murphy and McLysaght 2012; Xie et al. 2012; Neme and Tautz 2013; Ruiz-Orera et al. 2015; Guerzoni and McLysaght 2016) and implies that the evolution of de novo genes may be facilitated near older genes due to the high density of regulatory motifs, open chromatin and elevated GC content (McLysaght and Hurst 2016). The evolution of de novo genes should be facilitated in genomic regions with elevated GC content because they tend to harbor fewer AT-rich stop codons (Oliver and Marin 1996). Some of these features are also associated with de novo gene formation in intergenic regions (Vakirlis et al. 2017). Long noncoding RNAs (lncRNAs) also appear to represent another source of de novo genes, possibly because they are associated with transcriptionally active regions (Xie et al. 2012; Chen et al. 2015; Ruiz-Orera et al. 2015; Guerzoni and McLysaght 2016; Neme and Tautz 2016).

Levels of Intrinsic Disorder in De Novo Genes and Older Genes

It has been argued that de novo genes encoding for proteins that show low propensity to form aggregates, and thus are less prone to induce cytotoxicity, should be more likely to be fixed (Wilson et al. 2017). Wilson et al. calculated the ISD, a proxy for protein solubility (Monsellier and Chiti 2007; Pallares and Ventura 2016), in all mouse proteins and found that: 1) de novo genes showed the highest level of ISD, which suggested they were preadapted to become novel genes because they encode proteins with low tendency toward aggregation; 2) ISD levels increased throughout mouse genes phylostrata.

Here, I calculated ISD levels for the 152 validated rodent de novo genes and 20,391 older mouse genes using the algorithm implemented in the software PASTA (Walsh et al. 2014). Validated de novo genes showed a significantly higher proportion of ISD regions than older genes, including genes with comparable length (P <0.0001, Mann–Whitney U test; fig. 3a;supplementary table S3, Supplementary Material online). However, this derives from the particularly high levels of disorder in proteins encoded by the 20 overprinted de novo genes. ISD levels are significantly higher in these proteins compared with any other group of de novo or older proteins (all P <0.002, Mann–Whitney U test). On the contrary, I found no significant difference in ISD levels between proteins from short older genes and proteins encoded by either intergenic or overlapping but not overprinted de novo genes (P =0.051 and 0.105, respectively, Mann–Whitney U test). Interestingly, proteins encoded by older, non-de novo overprinted genes show no difference in ISD levels than all older genes, possibly because overprinting is significantly older among these genes compared with de novo genes and ISD levels change through time (fig. 3).

Fig. 3. — —Comparison of intrinsic structural disorder percentage (a) and GC content (b) between de novo genes and older genes. Gray boxes shows values between first and third quartile. Medians are shown as black lines. Whiskers: minimum and maximum values excluding outliers. OP, overprinting; OL, overlapping with noncoding regions of older genes; INT, intergenic.

Recent works have shown that high disorder levels in orphan and de novo proteins are associated with the elevated GC content of their genes (Basile et al. 2017; Vakirlis et al. 2017). In agreement with these findings, overprinted de novo genes showed significantly higher %GC compared with any other group of genes (all P <0.0004, Mann–Whitney U test; fig. 3b;supplementary table S3, Supplementary Material online), contrary to intergenic or overlapping but not overprinted de novo genes. Furthermore, the GC content and ISD levels were highly correlated in the complete set of analyzed mouse genes (r =0.92, Pearson correlation). The particularly elevated GC content in overprinted de novo genes can be explained by several factors. As already mentioned, the diminished frequency of stop codons in GC-rich regions allows longer ORFs to form. Additionally, the GC content is positively correlated with the transcriptional activity in mammalian cells (Kudla et al. 2006), which could increase the likelihood of proto-genes to be spuriously expressed and eventually evolve into functional genes.

Conclusions

The discovery of de novo genes in eukaryotes has revealed how evolutionary tinkering of noncoding regions can lead to novel protein sequences from scratch. Previous analyses relying uniquely on phylostratigraphic methods suggested that de novo genes are fixed at rates comparable to those of gene duplicates (Carvunis et al. 2012; Neme and Tautz 2013). This conclusion cannot be reconciled with the observed levels of interspecific gene homology and gene loss rates, what I referred to as the “gene turnover paradox.” Here, I used data from three previous studies to show that the majority of PDNGs thus far detected in rodents and still annotated in mouse represent either lineage-specific gene duplicates or rapidly evolving genes shared across mammals. The improved estimate of mouse-specific de novo genes points to a rate of novel gene formation that is several times lower than the gene duplication rate, a possible resolution of the “gene turnover paradox.” Importantly, these results also imply that the known homology detection error in phylostratigraphy is not minimized by focusing on the youngest genes in a given species, as previously suggested (Wilson et al. 2017). However, as shown in this and other studies (Murphy and McLysaght 2012; Vakirlis et al. 2017), false positive rates in de novo gene surveys can be significantly reduced by utilizing a combination of more sensitive homology search approaches and synteny analyses.

In one of the re-examined studies, Wilson et al. (2017) found that putative de novo proteins have the highest levels of ISD, a measure that negatively correlates with protein toxicity, among mouse proteins. This would suggest that de novo genes evolve more frequently from proto-genes that are preadapted because they encode peptides with low level of toxicity. Mouse de novo proteins validated in my study also show higher ISD levels than older genes; however, I found that this is due to a subset of de novo genes that share high GC content and overlap with coding exons of older genes. In agreement with recent observations (Basile et al. 2017; Vakirlis et al. 2017), this shows that the elevated disorder of mouse de novo proteins represent a mere consequence of the high %GC of some de novo genes, rather than supporting the preadaptation hypothesis.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(1.9MB, zip)}

Acknowledgments

I am grateful to Aaron Quinlan and Ryan Layer for allowing me to use the term “de nono.” I thank Michelle Lawing for help with statistical analyses and for comments on the manuscript and three anonymous reviewers for insightful comments that greatly improved the manuscript. This work has been supported by the National Institute of Food and Agriculture, U.S. Department of Agriculture, under award number TEX0-1-9599, the Texas A&M AgriLife Research, and the Texas A&M Forest Service.

Literature Cited

Basile W, Sachenkova O, Light S, Elofsson A.. 2017. High GC content causes orphan proteins to be intrinsically disordered. PLoS Comput Biol. 13:e1005375. [DOI] [PMC free article] [PubMed] [Google Scholar]
Begun DJ, Lindfors HA, Kern AD, Jones CD.. 2007. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics 176:1131–1137. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blake JA. et al. 2017. Mouse Genome Database (MGD)-2017: community knowledge resource for the laboratory mouse. Nucleic Acids Res. 45:D723–D729. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blankenberg D, Taylor J, Nekrutenko A, Galaxy T.. 2011. Making whole genome multiple alignments usable for biologists. Bioinformatics 27:2426–2428. [DOI] [PMC free article] [PubMed] [Google Scholar]
Camacho C, et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]
Carvunis AR, et al. 2012. Proto-genes and de novo gene birth. Nature 487:370–374. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chen JY, et al. 2015. Emergence, retention and selection: a trilogy of origination for functional de novo proteins from ancestral LncRNAs in primates. PLoS Genet. 11:e1005391. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chung WY, Wadhawan S, Szklarczyk R, Pond SK, Nekrutenko A.. 2007. A first look at ARFome: dual-coding genes in mammalian Genomes. PLoS Comput Biol. 3:855–861. [DOI] [PMC free article] [PubMed] [Google Scholar]
Domazet-Loso T, Brajkovic J, Tautz D.. 2007. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23:533–539. [DOI] [PubMed] [Google Scholar]
Elhaik E, Sabath N, Graur D.. 2006. The “inverse relationship between evolutionary rate and age of mammalian genes” is an artifact of increased genetic distance with rate of evolution and time of divergence. Mol Biol Evol. 23:1–3. [DOI] [PubMed] [Google Scholar]
Gibbs RA, et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428:493–521. [DOI] [PubMed] [Google Scholar]
Guerzoni D, McLysaght A.. 2016. De novo genes arise at a slow but steady rate along the primate lineage and have been subject to incomplete lineage sorting. Genome Biol Evol. 8:1222–1232. [DOI] [PMC free article] [PubMed] [Google Scholar]
Han MV, Thomas GW, Lugo-Martinez J, Hahn MW.. 2013. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol Biol Evol. 30:1987–1997. [DOI] [PubMed] [Google Scholar]
Harrow J, et al. 2012. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22:1760–1774. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heinen TJ, Staubach F, Haming D, Tautz D.. 2009. Emergence of a new gene from an intergenic region. Curr Biol. 19:1527–1531. [DOI] [PubMed] [Google Scholar]
Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG.. 2009. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 25:404–413. [DOI] [PubMed] [Google Scholar]
Kimura Y, Hawkins MT, McDonough MM, Jacobs LL, Flynn LJ.. 2015. Corrected placement of Mus-Rattus fossil calibration forces precision in the molecular tree of rodents. Sci Rep. 5:14444. [DOI] [PMC free article] [PubMed] [Google Scholar]
Knowles DG, McLysaght A.. 2009. Recent de novo origin of human protein-coding genes. Genome Res. 19:1752–1759. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kudla G, Lipinski L, Caffin F, Helwak A, Zylicz M.. 2006. High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biol. 4:e180. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li CY, et al. 2010. A human-specific de novo protein-coding gene associated with human brain functions. PLoS Comput Biol. 6:e1000734. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lu TC, Leu JY, Lin WC.. 2017. A comprehensive analysis of transcript-supported de novo genes in Saccharomyces sensu stricto Yeasts. Mol Biol Evol. [DOI] [PMC free article] [PubMed] [Google Scholar]
Marchler-Bauer A, et al. 2017. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 45:D200–D203. [DOI] [PMC free article] [PubMed] [Google Scholar]
McLysaght A, Guerzoni D.. 2015. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Philos Trans R Soc Lond B Biol Sci. 370:20140332. [DOI] [PMC free article] [PubMed] [Google Scholar]
McLysaght A, Hurst LD.. 2016. Open questions in the study of de novo genes: what, how and why. Nat Rev Genet. 17:567–578. [DOI] [PubMed] [Google Scholar]
Monsellier E, Chiti F.. 2007. Prevention of amyloid-like aggregation as a driving force of protein evolution. EMBO Rep. 8:737–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moyers BA, Zhang J.. 2015. Phylostratigraphic bias creates spurious patterns of genome evolution. Mol Biol Evol. 32:258–267. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moyers BA, Zhang J.. 2016. Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Mol Biol Evol. 33:1245–1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
Moyers BA, Zhang JZ.. 2017. Further simulations and analyses demonstrate open problems of phylostratigraphy. Genome Biol Evol. 9:1519–1527. [DOI] [PMC free article] [PubMed] [Google Scholar]
Murphy DN, McLysaght A.. 2012. De novo origin of protein-coding genes in murine rodents. PLoS One 7:e48650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neme R, Tautz D.. 2013. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics 14:117. [DOI] [PMC free article] [PubMed] [Google Scholar]
Neme R, Tautz D.. 2016. Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence. Elife 5:e09977. [DOI] [PMC free article] [PubMed] [Google Scholar]
O'Leary NA, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]
O'Toole AN, Hurst LD, McLysaght A.. 2018. Faster evolving primate genes are more likely to duplicate. Mol Biol Evol. 35:107–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
Oliver JL, Marin A.. 1996. A relationship between GC content and coding-sequence length. J Mol Evol. 43:216–223. [DOI] [PubMed] [Google Scholar]
Pallares I, Ventura S.. 2016. Understanding and predicting protein misfolding and aggregation: insights from proteomics. Proteomics 16:2570–2581. [DOI] [PubMed] [Google Scholar]
Pavesi A, Magiorkinis G, Karlin DG.. 2013. Viral proteins originated de novo by overprinting can be identified by codon usage: application to the “gene nursery” of Deltaretroviruses. PLoS Comput Biol. 9:e1003162. [DOI] [PMC free article] [PubMed] [Google Scholar]
Reinhardt JA, et al. 2013. De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences. PLoS Genet. 9:e1003860. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ruiz-Orera J, et al. 2015. Origins of de novo genes in human and chimpanzee. PLoS Genet. 11:e1005721. [DOI] [PMC free article] [PubMed] [Google Scholar]
Saripella GV, Sonnhammer EL, Forslund K.. 2016. Benchmarking the next generation of homology inference tools. Bioinformatics 32:2636–2641. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmitz JF, Ullrich KK, Bornberg-Bauer E.. 2018. Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover. Nat Ecol Evol Epub. 2:1626–1632. [DOI] [PubMed] [Google Scholar]
Schwartz S, et al. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13:103–107. [DOI] [PMC free article] [PubMed] [Google Scholar]
Smith SA, Pease JB.. 2017. Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny. Brief Bioinform. 18:451–457. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tautz D, Domazet-Loso T.. 2011. The evolutionary origin of orphan genes. Nat Rev Genet. 12:692–702. [DOI] [PubMed] [Google Scholar]
Vakirlis N, et al. 2016. Reconstruction of ancestral chromosome architecture and gene repertoire reveals principles of genome evolution in a model yeast genus. Genome Res. 26:918–932. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vakirlis NN, et al. 2017. A molecular portrait of de novo genes in yeasts. Mol Biol Evol. 35:631–645. [DOI] [PMC free article] [PubMed] [Google Scholar]
Walsh I, Seno F, Tosatto SC, Trovato A.. 2014. PASTA 2.0: an improved server for protein aggregation prediction. Nucleic Acids Res. 42:W301–W307. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wilson BA, Foy SG, Neme R, Masel J.. 2017. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat Ecol Evol. 1:0146. [DOI] [PMC free article] [PubMed] [Google Scholar]
Worley KC, et al. 2014. The common marmoset genome provides insight into primate biology and evolution. Nat Genet. 46:850–857. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xie C, et al. 2012. Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs. PLoS Genet. 8:e1002942. [DOI] [PMC free article] [PubMed] [Google Scholar]
Xu J, Zhang J.. 2016. Are human translated pseudogenes functional? Mol Biol Evol. 33:755–760. [DOI] [PMC free article] [PubMed] [Google Scholar]
Yomo T, Urabe I.. 1994. A frame-specific symmetry of complementary strands of DNA suggests the existence of genes on the antisense strand. J Mol Evol. 38:113–120. [DOI] [PubMed] [Google Scholar]
Zhao L, Saelao P, Jones CD, Begun DJ.. 2014. Origin and spread of de novo genes in Drosophila melanogaster populations. Science 343:769–772. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(1.9MB, zip)}

[evy231-B1] Basile W, Sachenkova O, Light S, Elofsson A.. 2017. High GC content causes orphan proteins to be intrinsically disordered. PLoS Comput Biol. 13:e1005375. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B2] Begun DJ, Lindfors HA, Kern AD, Jones CD.. 2007. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics 176:1131–1137. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B3] Blake JA. et al. 2017. Mouse Genome Database (MGD)-2017: community knowledge resource for the laboratory mouse. Nucleic Acids Res. 45:D723–D729. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B4] Blankenberg D, Taylor J, Nekrutenko A, Galaxy T.. 2011. Making whole genome multiple alignments usable for biologists. Bioinformatics 27:2426–2428. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B5] Camacho C, et al. 2009. BLAST+: architecture and applications. BMC Bioinformatics 10:421. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B6] Carvunis AR, et al. 2012. Proto-genes and de novo gene birth. Nature 487:370–374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B7] Chen JY, et al. 2015. Emergence, retention and selection: a trilogy of origination for functional de novo proteins from ancestral LncRNAs in primates. PLoS Genet. 11:e1005391. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B8] Chung WY, Wadhawan S, Szklarczyk R, Pond SK, Nekrutenko A.. 2007. A first look at ARFome: dual-coding genes in mammalian Genomes. PLoS Comput Biol. 3:855–861. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B9] Domazet-Loso T, Brajkovic J, Tautz D.. 2007. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends Genet. 23:533–539. [DOI] [PubMed] [Google Scholar]

[evy231-B10] Elhaik E, Sabath N, Graur D.. 2006. The “inverse relationship between evolutionary rate and age of mammalian genes” is an artifact of increased genetic distance with rate of evolution and time of divergence. Mol Biol Evol. 23:1–3. [DOI] [PubMed] [Google Scholar]

[evy231-B11] Gibbs RA, et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428:493–521. [DOI] [PubMed] [Google Scholar]

[evy231-B12] Guerzoni D, McLysaght A.. 2016. De novo genes arise at a slow but steady rate along the primate lineage and have been subject to incomplete lineage sorting. Genome Biol Evol. 8:1222–1232. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B13] Han MV, Thomas GW, Lugo-Martinez J, Hahn MW.. 2013. Estimating gene gain and loss rates in the presence of error in genome assembly and annotation using CAFE 3. Mol Biol Evol. 30:1987–1997. [DOI] [PubMed] [Google Scholar]

[evy231-B14] Harrow J, et al. 2012. GENCODE: the reference human genome annotation for the ENCODE Project. Genome Res. 22:1760–1774. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B15] Heinen TJ, Staubach F, Haming D, Tautz D.. 2009. Emergence of a new gene from an intergenic region. Curr Biol. 19:1527–1531. [DOI] [PubMed] [Google Scholar]

[evy231-B16] Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TCG.. 2009. More than just orphans: are taxonomically-restricted genes important in evolution? Trends Genet. 25:404–413. [DOI] [PubMed] [Google Scholar]

[evy231-B17] Kimura Y, Hawkins MT, McDonough MM, Jacobs LL, Flynn LJ.. 2015. Corrected placement of Mus-Rattus fossil calibration forces precision in the molecular tree of rodents. Sci Rep. 5:14444. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B18] Knowles DG, McLysaght A.. 2009. Recent de novo origin of human protein-coding genes. Genome Res. 19:1752–1759. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B19] Kudla G, Lipinski L, Caffin F, Helwak A, Zylicz M.. 2006. High guanine and cytosine content increases mRNA levels in mammalian cells. PLoS Biol. 4:e180. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B20] Li CY, et al. 2010. A human-specific de novo protein-coding gene associated with human brain functions. PLoS Comput Biol. 6:e1000734. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B21] Lu TC, Leu JY, Lin WC.. 2017. A comprehensive analysis of transcript-supported de novo genes in Saccharomyces sensu stricto Yeasts. Mol Biol Evol. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B22] Marchler-Bauer A, et al. 2017. CDD/SPARCLE: functional classification of proteins via subfamily domain architectures. Nucleic Acids Res. 45:D200–D203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B24] McLysaght A, Guerzoni D.. 2015. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Philos Trans R Soc Lond B Biol Sci. 370:20140332. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B25] McLysaght A, Hurst LD.. 2016. Open questions in the study of de novo genes: what, how and why. Nat Rev Genet. 17:567–578. [DOI] [PubMed] [Google Scholar]

[evy231-B26] Monsellier E, Chiti F.. 2007. Prevention of amyloid-like aggregation as a driving force of protein evolution. EMBO Rep. 8:737–742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B27] Moyers BA, Zhang J.. 2015. Phylostratigraphic bias creates spurious patterns of genome evolution. Mol Biol Evol. 32:258–267. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B28] Moyers BA, Zhang J.. 2016. Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Mol Biol Evol. 33:1245–1256. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B29] Moyers BA, Zhang JZ.. 2017. Further simulations and analyses demonstrate open problems of phylostratigraphy. Genome Biol Evol. 9:1519–1527. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B30] Murphy DN, McLysaght A.. 2012. De novo origin of protein-coding genes in murine rodents. PLoS One 7:e48650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B31] Neme R, Tautz D.. 2013. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics 14:117. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B32] Neme R, Tautz D.. 2016. Fast turnover of genome transcription across evolutionary time exposes entire non-coding DNA to de novo gene emergence. Elife 5:e09977. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B33] O'Leary NA, et al. 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 44:D733–D745. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B34] O'Toole AN, Hurst LD, McLysaght A.. 2018. Faster evolving primate genes are more likely to duplicate. Mol Biol Evol. 35:107–118. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B35] Oliver JL, Marin A.. 1996. A relationship between GC content and coding-sequence length. J Mol Evol. 43:216–223. [DOI] [PubMed] [Google Scholar]

[evy231-B36] Pallares I, Ventura S.. 2016. Understanding and predicting protein misfolding and aggregation: insights from proteomics. Proteomics 16:2570–2581. [DOI] [PubMed] [Google Scholar]

[evy231-B37] Pavesi A, Magiorkinis G, Karlin DG.. 2013. Viral proteins originated de novo by overprinting can be identified by codon usage: application to the “gene nursery” of Deltaretroviruses. PLoS Comput Biol. 9:e1003162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B38] Reinhardt JA, et al. 2013. De novo ORFs in Drosophila are important to organismal fitness and evolved rapidly from previously non-coding sequences. PLoS Genet. 9:e1003860. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B39] Ruiz-Orera J, et al. 2015. Origins of de novo genes in human and chimpanzee. PLoS Genet. 11:e1005721. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B40] Saripella GV, Sonnhammer EL, Forslund K.. 2016. Benchmarking the next generation of homology inference tools. Bioinformatics 32:2636–2641. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B41] Schmitz JF, Ullrich KK, Bornberg-Bauer E.. 2018. Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover. Nat Ecol Evol Epub. 2:1626–1632. [DOI] [PubMed] [Google Scholar]

[evy231-B42] Schwartz S, et al. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13:103–107. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B43] Smith SA, Pease JB.. 2017. Heterogeneous molecular processes among the causes of how sequence similarity scores can fail to recapitulate phylogeny. Brief Bioinform. 18:451–457. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B44] Tautz D, Domazet-Loso T.. 2011. The evolutionary origin of orphan genes. Nat Rev Genet. 12:692–702. [DOI] [PubMed] [Google Scholar]

[evy231-B45] Vakirlis N, et al. 2016. Reconstruction of ancestral chromosome architecture and gene repertoire reveals principles of genome evolution in a model yeast genus. Genome Res. 26:918–932. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B46] Vakirlis NN, et al. 2017. A molecular portrait of de novo genes in yeasts. Mol Biol Evol. 35:631–645. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B47] Walsh I, Seno F, Tosatto SC, Trovato A.. 2014. PASTA 2.0: an improved server for protein aggregation prediction. Nucleic Acids Res. 42:W301–W307. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B48] Wilson BA, Foy SG, Neme R, Masel J.. 2017. Young genes are highly disordered as predicted by the preadaptation hypothesis of de novo gene birth. Nat Ecol Evol. 1:0146. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B23] Worley KC, et al. 2014. The common marmoset genome provides insight into primate biology and evolution. Nat Genet. 46:850–857. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B49] Xie C, et al. 2012. Hominoid-specific de novo protein-coding genes originating from long non-coding RNAs. PLoS Genet. 8:e1002942. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B50] Xu J, Zhang J.. 2016. Are human translated pseudogenes functional? Mol Biol Evol. 33:755–760. [DOI] [PMC free article] [PubMed] [Google Scholar]

[evy231-B51] Yomo T, Urabe I.. 1994. A frame-specific symmetry of complementary strands of DNA suggests the existence of genes on the antisense strand. J Mol Evol. 38:113–120. [DOI] [PubMed] [Google Scholar]

[evy231-B52] Zhao L, Saelao P, Jones CD, Begun DJ.. 2014. Origin and spread of de novo genes in Drosophila melanogaster populations. Science 343:769–772. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

From De Novo to “De Nono”: The Majority of Novel Protein-Coding Genes Identified with Phylostratigraphy Are Old Genes or Recent Duplicates

Claudio Casola

Roles

Abstract

Introduction

Materials and Methods

Putative De Novo Genes (PDNGs)

Table 1.

Updated Annotation of PDNGs

Sequence Similarity Analyses to Identify Paralogs

Sequence Similarity Analyses to Identify Homologs in Vertebrates

Synteny Analyses

Protein Domain Analyses

Gene Structure

Quality of Gene Annotation

Protein Disorder and Protein Aggregation Analyses

Estimates of Gene Duplication Rates

Overlap Between Genes

Rat De Novo Gene Orthologs

Results and Discussion

Putative De Novo Gene Annotation Status

Fig. 1.

Fig. 2.

Sequence Similarity Analyses to Identify Homologous Genes in Nonrodent Genomes

Table 2.

Similarity Analyses to Identify Paralogous Genes in Mouse

Synteny Analyses

Conserved Domains in PDNGs

New Estimates of Rodents De Novo Genes

Rate of De Novo Gene Formation in Mouse

Characteristics of Mouse De Novo Genes

Levels of Intrinsic Disorder in De Novo Genes and Older Genes

Fig. 3.

Conclusions

Supplementary Material

Acknowledgments

Literature Cited

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases