Abstract
A novel mechanism of de novo gene origination from nongenic sequences was first proposed in the early 2000s. Subsequent studies have since provided evidence of de novo gene emergence across all domains of life, revealing its occurrence to be more frequent than initially anticipated. While studies mainly agree on the general concept of de novo emergence from nongenic DNA, the exact methods and definitions for detecting de novo genes differ significantly. Here, we provide a comprehensive step-by-step description of the most commonly used methods for de novo gene detection. In addition, we address the limitations of nomenclature and detection methods and clarify some complex concepts that are sometimes misused. This review is accompanied by the publication of a de novo gene annotation format to standardize the reporting of methodology, enable reproducibility and improve the comparability of datasets.
Keywords: de novo gene emergence, annotation format, comparative genomics, comparative transcriptomics
Significance.
While it is now widely accepted that new genes can emerge from previously noncoding DNA, researchers still lack consistent methods and definitions for identifying these “de novo” genes. This review lays out the most common techniques for detecting de novo genes, highlights their differences and limitations, and introduces a standard annotation format to make future studies more comparable. By establishing clearer methodological guidelines, this work helps unify a rapidly growing field and improves our ability to study how entirely new genes arise.
Introduction
Throughout evolution, genes can arise by “recycling the old”, emerging from pre-existing genetic material through mechanisms (Fig. 1) such as duplication (Ohno 2013), exon shuffling (Gilbert 1978), horizontal gene transfer (Freeman 1951), retrotransposition (Baltimore 1970; Temin et al. 1970; Coffin and Fan 2016) and gene fusion (Nowell and Hungerford 1960; Mitelman et al. 2007).
Fig. 1.
Main mechanisms of genes emergence. The figure represents a simplified representation of the concepts discussed.
However, it is now well documented that new genes can also emerge de novo, through a series of mutations in the noncoding genome (Begun et al. 2006, 2007; Heinen et al. 2009; Rancurel et al. 2009; Toll-Riera et al. 2009; Tautz and Domazet-Lošo 2011; Neme and Tautz 2013; Zhao et al. 2014; Xia et al. 2025). Several mechanisms of de novo gene emergence have been identified (Fig. 1):
Overprinting. Overprinting refers to the emergence of an alternative ORF in a +1 or +2 frame in forward direction of an existing gene, leading to the translation of a protein entirely different from the canonical protein (Keese and Gibbs 1992; Rogozin et al. 2002; Pavesi 2006; Delaye et al. 2008; Carter et al. 2013). It can also emerge upstream the ancestral open reading frame (ORF), for example in the 5’ UTR (Renz et al. 2020), or downstream the ancestral ORF, in the 3’UTR (Wu et al. 2020).
Exonization. Exonization characterises mutations inside a gene, leading to the gain of a new exon, and potentially to a new ORF (Makałowski et al. 1994; Sorek 2007; Cai et al. 2008; Schmitz and Brosius 2011). It can occur for example by the the loss of a splicing sites, converting an original intron into an exon (Roy 2004; Moldón and Query 2010; Koralewski and Krutovsky 2011).
Antisense. De novo genes can also emerge overlapping existing genes, but on the opposite strand (Merino et al. 1994; Barman et al. 2019; Iyengar et al. 2024). Transcribed antisense de novo emerged ORFs have been reported to generate functional proteins (Ardern et al. 2020; Thomas et al. 2023) or regulate translation efficiency (Liang et al. 2016).
From scratch in intergenic region. In order to emerge in an intergenic region, the future gene needs to acquire everything: a transcription event, an ORF, the ability to be translated, a certain stability in the untranslated regions (UTRs), eventually introns, etc. However, such challenging mechanisms of new gene emergence have been heavily reported (McLysaght and Guerzoni 2015; Schlötterer 2015; Heames et al. 2020; Papadopoulos et al. 2021; Iyengar and Bornberg-Bauer 2023; Lombardo et al. 2023).
De novo gene birth driven by Transposable Elements (TEs). Cumulative evidence tend to show that de novo gene birth can also be linked to the insertion of a TE inside an intergenic sequence (Cordaux et al. 2006; Jin et al. 2021; Delihas 2022; Lee et al. 2022; Lebherz et al. 2024a; Uz-Zaman et al. 2024).
Despite their different origins, these mechanisms share a common feature: the de novo gene or its encoded protein lack detectable similarity to any other known gene or protein (McLysaght and Hurst 2016). However, also other mechanisms like duplication and divergence can generate genes without detectable similarity (Tautz and Domazet-Lošo 2011)
Therefore, one of the major challenges in de novo gene research is to accurately determine whether a gene truly emerged de novo or has arisen through other mechanisms (Tautz and Domazet-Lošo 2011; Casola 2018). For example, after a duplication event, the duplicated gene copy can evolve rapidly and its sequence can undergo significant rearrangement (Innan and Kondrashov 2010) so that it is misidentified as originating de novo. The work of (Casola 2018) shed light on inaccuracies in the validation of de novo gene emergence, and was followed by significant advances in the precision of detection and the design of pipelines for confirming de novo origins. As methods for de novo gene detection and validation have become more sophisticated, proper annotation of the methodology has become essential (Moyers and Zhang 2016, 2017; Weisman et al. 2022).
In the field of de novo gene research, the mechanisms and definitions of de novo emergence remain a pivotal yet variable factor in identifying such genes. Across studies, authors have incorporated diverse evolutionary stages and criteria (Keeling et al. 2019; Weisman 2022), such as varying thresholds for how much of a gene must have originated de novo (McLysaght and Hurst 2016), and differing standards to establish the absence of homology (Casola 2018; Vakirlis et al. 2020; Weisman et al. 2022). Although this conceptual diversity has enriched the field, it has also introduced ambiguities that challenge the consistency and comparability of results (Schmitz et al. 2018; Dohmen et al. 2025).
This diversity arises from several factors. First, the field of de novo gene birth is still relatively new, which has led researchers to explore a variety of methods to investigate different mechanisms and address distinct questions. For example, studying the emergence of de novo transcription requires different approaches than examining the early fixation of genes across multiple taxa. Additionally, the data used vary between studies, depending on the species of interest. For instance, nonmodel organism genomes are less well-characterized and annotated than those of humans (Vakirlis and Kupczok 2024), and they typically lack extensive transcriptomic data. At this stage, maintaining an openness to exploring various methodologies remains crucial, but addressing these semantic and conceptual divergences is equally important to advance the field and improve the integration of findings across studies.
In this review, we outline the key steps that currently allow for accurate discrimination between de novo genes and genes arising from other mechanisms. We also highlight the main methodological differences between studies and address the challenges and controversies that remain with current approaches. In this article, we define a de novo gene as one that emerges from a previously noncoding region of the genome through mutations. Notably, the detection methods we describe assume the presence of a transcribed ORF, although the definition of a gene does not always require it to be protein-coding (Orgogozo et al. 2016; Li and Liu 2019). Any deviations from this definition in specific cases are explicitly noted in the text.
The main goal of this review is to provide a comprehensive overview of current methodologies, their strengths, and weaknesses to allow for informed decisions about the most suitable approach for a given research question in the field of de novo research. It is not meant to recommend one method or tool over another, but rather to be able to identify the best approach for a given context and input data. As a consequence of the differences in methods and approaches identified here, and as a guidance for choices in de novo gene identification pipelines, we have developed an annotation format to standardize the reporting of the methodology used, that allows for easy comparison between datasets (Dohmen et al. 2025).
Tools and Techniques in the Computational Detection of De Novo Genes
Choice of Candidate Genes
The initial step in the identification of de novo genes or proto-genes is the selection of candidate genes from a given species, population or individual. Unless a subset of genes has already been identified as candidate de novo genes, often, the entire genome or transcriptome is screened to distinguish de novo genes from others. Two distinct approaches are commonly employed in the identification of de novo genes: the first involves the assessment of annotated genes within an annotated genome, while the second entails the evaluation of ORFs extracted from a transcriptome, sometimes accompanied by the validation of translation.
Candidate Genes from Annotated Genomes
The identification of potential de novo genes in an annotated genome consists in determining which annotated genes correspond to genes that have potentially emerged de novo in a specified taxonomic group. Therefore, following the annotation, all identified genes will be considered as candidates for a de novo origin analysis. Annotated genomes can be obtained from public databases, such as NCBI (Schoch et al. 2020), or they can be obtained through genome assembly from DNA-seq data followed by application of a gene annotation pipeline. In the latter case, it is necessary to annotate the genomes. In the specific context of de novo gene detection, a combination of homology-based approaches (Söding 2005; Eddy 2009) with ab initio approaches (Wang et al. 2004; Scalzitti et al. 2020; Baker et al. 2023) is encouraged, given that the latter relies on algorithms that recognize various genic properties within a genome even without gene homology (Fig. 2a, Table 1).
Fig. 2.
Considerations for general approaches and standards in de novo gene research. Related literature can be found in Table 1.
Table 1.
Considerations and related literature for general approaches and standards in de novo gene research.
Candidate Genes from Transcriptomes
Another option for the detection of candidate de novo genes is to analyze transcripts from one or multiple transcriptomes. Starting from a transcriptome implies access to an annotated genome. However, in this case, annotated genes are regarded as canonical genes rather than putative de novo genes. Instead, the transcriptome is screened for transcribed ORFs that do not overlap with annotated genes. These unannotated transcribed ORFs are then considered as putative de novo genes. The scenario in which no reference genome is available is also discussed below, but it requires particular caution. This approach involves more initial steps described below, but it likely allows for the detection of de novo genes in their early stages of emergence, such as proto-genes (Carvunis et al. 2012) or de novo ORFs (Grandchamp et al. 2023b). The steps described in the following assume that the transcriptome has already been assembled based on a reference genome, using reference-based algorithms (Kovaka et al. 2019; Raghavan et al. 2022). If a transcriptome has been assembled de novo, the primary deviation from the described method resides in the identification of genomic locations of the ORFs. If no reference genome exists for the query species, the genomic location of ORFs may lack precision, which may lead to misleading interpretations in subsequent steps of the pipeline. We encourage authors pursuing such an approach to draw conclusions carefully. Finally, another possibility would be to search for ORFs in a de novo assembled transcriptome and directly look for homologous sequences in target databases, without relying on genomic information. To our knowledge, no study has implemented such a pipeline, although it may be the only feasible option when genome data is extremely limited. However, using the term de novo genes to describe genes identified through such a pipeline is misleading. Instead, we suggest using more cautious terminology—such as specifying their putative status—and clearly stating the limitations of the approach.
Selection of transcripts based on genomic location
De novo genes can be located in various genomic regions, including intergenic spaces, introns, overlapping existing genes in a different frame or antisense orientation, within UTRs, or other nongenic locations. Depending on the investigated de novo emergence mechanism(s), certain transcripts (or ORFs) may be excluded from the analysis. Utilizing tools such as BEDtools (Quinlan and Hall 2010) facilitates the determination of the genomic overlap of the transcripts, and the choice of which transcripts will be retained as candidates for further analyses. This step can also be conducted using ORFs instead of transcripts, after ORF detection in transcripts. The appropriate use of BEDTools depends on the type of RNA-seq library. If the library is strand-specific, BEDTools should be used with strand-aware operations. In contrast, strand-awareness is not applicable for unstranded libraries. If the RNA-seq data is not stranded, it is not possible to determine the strand on which an ORF is located, which can significantly affect the conclusions. For example, an ORF overlapping an existing gene would be considered antisense if it originates from the opposite strand, whereas it would be classified as an overprinted gene if it arises on the same strand.
Detection of ORFs in a transcriptome
After filtering transcripts based on their genomic location, the selected spliced transcripts are scanned for ORFs. Various software tools are available for extracting ORFs from a transcriptome, with one notable example being EMBOSS getorf (Rice et al. 2000). This tool conveniently provides information on the position of the ORF in the spliced transcript and its direction (forward or reverse). However, ORFs that extend to the end of a transcript without ending with a stop codon are also retrieved, which might be considered as erroneous and should be removed.
In order to extract the ORFs relevant to a given biological question, a number of steps must be followed:
If the RNA is stranded, detected antisense ORFs may be erroneous and should be regarded with caution.
Multiple transcripts may correspond to the spliced product of a single gene, and some might overlap (Lebherz et al. 2024b). In such cases, removing duplicated ORFs shared among transcripts spliced from the same genomic location may be necessary.
The majority of transcripts contains multiple ORFs, and the choice of the ORF(s) within a transcript depends on the biological question, and various choices are valid (Xu et al. 2010).
Choice of Coding ORFs
When starting from a transcriptome with transcripts containing several ORFs, the selection of which ORFs to keep for further steps is decided by the investigator. Until recently, ORFs were typically considered potentially coding only if their size exceeded 300 nucleotides, a criterion implemented in algorithms such as those used by the Functional ANnoTation Of the Mammalian Genome (FANTOM) (Dinger et al. 2008; Leong et al. 2022). However, micropeptide and short de novo genes are known to have coding potential (Patraquim et al. 2022; Vakirlis et al. 2022; Sandmann et al. 2023), and de novo genes have been shown to be shorter (Guo et al. 2007; Toll-Riera et al. 2009; Palmieri et al. 2014). Various software tools have been developed to determine which ORF should be considered as the coding one in canonical genes, using approaches primarily based on protein homology (Vitting-Seerup et al. 2014; Kang et al. 2017; Varabyou et al. 2023). Nevertheless, even for canonical genes, the definition and number of coding ORFs are under revision, as the coding potential of genes has been shown to be significantly underestimated (Wright et al. 2022; Ardern 2023).
In transcripts, all ORFs within a size limit can be considered. The majority of studies opt for the longest ORF (Xu et al. 2010; Dowling et al. 2020), which is also the default option for annotating protein-coding regions in most software (Rombel et al. 2002; Wang et al. 2013). Some studies only consider the first upstream ORF (uORFs) (Whiffin et al. 2020). Other studies consider the ORFs with the highest Kozak score (Kozak 1989; Xu et al. 2010), indicating the highest likelihood of translation, or ORFs including surrounding untranslated regions (UTRs), since UTRs play crucial roles in translation initiation and transcript stability (Chatterjee and Pal 2009; Matoulkova et al. 2012).
Importantly, the detection of ORFs with coding potential does not guarantee a translation event. Several studies have reported only a weak correlation between transcript expression levels and protein abundance (Gry et al. 2009; Koussounadis et al. 2015; Liu et al. 2016). This emphasizes that a transcribed ORF is strongly dependent on posttranscriptional and translational regulatory mechanisms for translation, which is difficult to predict without experimental evidence.
Selection of an expression threshold
Most studies include only the ORFs from transcripts that reach a minimum level of expression, which is typically determined by the transcripts per million (TPM) threshold. A threshold of 0.5 TPM has been adopted by numerous studies (Petryszak et al. 2016; Poretti et al. 2023; Vara et al. 2024) as specified by EMBL (Stoesser et al. 2002) as the minimal expression threshold. When assembling transcriptomes, low-expressed transcripts are often removed from the process as they are suspected to represent background noise (Janssen et al. 2023). However, emergence of low-expressed transcripts could be a step towards de novo gene emergence, and such transcripts might be important to study. The hypothesis that transcripts are produced throughout the entire genome of a species is referred to as pervasive transcription (Clark et al. 2011; Hangauer et al. 2013; Kellis et al. 2014). In cases involving splicing, it is crucial to be cautious when employing a TPM threshold. It is plausible for a gene to express multiple transcripts, where one transcript meets the specified threshold while the others do not.
Detection of genomic positions of unspliced transcripts and ORFs
In order to account for splicing events and the subsequent methodological steps, the genomic position of the selected ORFs must be detected. The software BLAT (Kent 2002) is splicing-aware and can be used to map ORFs from a transcriptome to the corresponding genome. However, BLAT has difficulties dealing with short sequences, as de novo ORFs often are. Instead of aligning intact ORFs, BLAT overpredicts splicing events by splitting up ORFs to align them to multiple locations in the genome. Other splice-aware software, such as Exonerate (Slater and Birney 2005), can also be used, although they share similar limitations. The most precise method for retrieving the genomic location of an ORF is to extract the coordinates from the transcript it originates from. This accurate approach is only feasible if the transcriptome is assembled using reference-based algorithms. While the genomic coordinates of transcripts are provided by the assembly software, the precise genomic positions of the ORFs within them must be calculated based on the transcript’s genomic location, its exon-intron structure and genomic strand orientation (forward or reverse). To our knowledge, such a step cannot be fulfilled by existing software and requires custom scripts.
After all these steps, all filtered ORFs and/or transcripts can be considered as candidate de novo genes and will be used for the next filtering steps.
Validation of Translation
To assess whether the selected candidate genes are coding genes, one option is to use experimental validation (Fig. 2e, Table 1). Experimental validation of a gene’s coding status can be performed at the very end of the methodology, when only a subset of genes has been validated as de novo genes. However, when starting from a transcriptome, validating translation can be the very first step of the method. In such cases, all translated ORFs detected experimentally are mapped to the corresponding transcriptome (Wacholder et al. 2023) and subsequently sorted through several steps similar to those used in transcriptome analysis (Turcan et al. 2024).
To confirm the coding status of putative de novo genes, several new laboratory techniques have proven to be highly effective, particularly for small proteins. Ribosome profiling-based approaches (Ribo-Seq) (Ingolia et al. 2009; Kondo et al. 2010; Ingolia et al. 2011; Bazzini et al. 2014; Chen et al. 2020; Duffy et al. 2022) and mass spectrometry-based approaches (Slavoff et al. 2013; Pauli et al. 2014; Ji et al. 2015) assess the binding of ribosomes to transcribed ORFs or the presence of translated proteins. These two approaches can also be combined for better accuracy (Schlesinger and Elsässer 2022; Wacholder and Carvunis 2023; Andjus et al. 2024).
The search for population genetic signatures can also provide evidence of coding potential, particularly if the ORF is under selection. Tests such as the McDonald-Kreitman test (McDonald and Kreitman 1991), which assess the ratio of nonsynonymous to synonymous polymorphisms (pN/pS), can help determine whether an ORF is subject to selective pressure. However, these tests are often challenging to apply in the context of de novo gene emergence, as they require that the sequences across the population—and ideally in an outgroup species—are coding. This condition is rarely met during the early stages of de novo gene evolution.
Genomes or Transcriptomes?
The choice between candidate de novo genes from annotated genomes or transcriptomes depends on the biological question being investigated. Candidate genes from an annotated genome provide a high level of confidence about the genic status of the identified de novo genes at the end of the pipeline. Evolutionary fixation in a species is more likely for these genes, as their genic structures are apparently stable enough to be recognized by annotation methods. Nevertheless, de novo genes that are lacking gene homology or genic structures, such as introns or specific transcription motifs, may not be detected by annotation tools.
Selecting candidate genes from a transcriptome generally results in the identification of a considerably higher number of de novo genes compared to candidate genes from an annotated genome. For example, in Roginski et al. (2024), the authors detected 89 de novo genes in humans when starting from a genome, while Dowling et al. (2020) identified 2,749 human-specific de novo expressed ORFs when starting from a transcriptome. Similarly, Roginski et al. (2024) detected 92 de novo genes in Drosophila melanogaster by analyzing an annotated genome, while Zheng and Zhao (2022) identified 993 de novo genes in the same species using Ribo-seq data mapped to a transcriptome. However, depending on the specific transcriptome and the applied criteria, it is possible that the majority of the detected translated ORFs may not be fixed in the species (Roginski et al. 2024). An automated pipeline to detect very early stages of potential de novo genes based on transcriptomic data verifies this higher number of candidates (Grandchamp et al. 2025) when compared with annotated genome-based approaches.
The genic status of de novo candidates can be confirmed through the validation of translation as described above and subsequently only considering the translated ORFs. When starting from a transcriptome, one important issue can come from the fact that transcript expression is complicated to characterise, as expression can depend on conditions, tissues, sex, life stage, individuals or populations, among others (Oliva et al. 2020; Nieuwenhuis et al. 2021; Xu et al. 2023; Schneider et al. 2024). Consequently, particular de novo genes can be specific to certain conditions or tissues (Fig. 2a, Table 1). The detection of such genes can be more challenging, particularly when their expression levels are low.
Taxonomic Group of Emergence
A de novo gene or expressed ORF may be specific to an individual, a population, a species, or a broader taxonomic group. When starting from a transcriptome, it may also be expressed only under specific conditions, such as in a specific tissue, age or sex. The taxonomic level of emergence can but does not have to be specified in advance, ensuring that only de novo genes meeting a particular condition are retained. If a gene is not specific to a single species, population, or condition, but instead is shared among several closely related species, it is referred to as a taxonomically restricted gene. The distinction between de novo genes and other genes becomes more challenging when they are shared by several rather than one single species, particularly if they have an evolutionary origin predating a loss of synteny within the taxa to which they belong, and if they exhibit a high mutation rate, although this is likely not frequent (Domazet-Lošo et al. 2017). The more distantly related the species in the taxonomic group are, the more information is lost about de novo gene emergence or their mechanism of emergence in general. De novo gene birth is easier to identify in taxonomic groups including species that diverged recently, provided that the considered evolutionary time is sufficient to characterize the genicity of the sequences. A large number of studies focuses on species-specific de novo genes (Zhao et al. 2014; Schmitz et al. 2018; Zhang et al. 2019; Broeils et al. 2023; Grandchamp et al. 2023a, 2023b; Lebherz et al. 2024b; Vara et al. 2024). Alternatively, there is the possibility of detecting the earliest stage of a gene emergence by studying the emergence of a de novo transcribed ORF in individuals or populations. In such a case, the search for homology is conducted against outgroup species, but also against outgroup populations/individuals from the same species, if such data is available (Grandchamp et al. 2023b).
Homology Filter
One major criterion for identifying a recent de novo gene is the lack of homology to any other coding genes outside and inside of the expected phylogenetic group/species/population of emergence. However, we emphasize that the simple absence of homology is not enough to conclusively validate a de novo origin. The homology search has to be performed for the full dataset of candidate genes from the previous steps. All of them that show significant homology can then be discarded from the list of potential de novo genes.
Each de novo gene is required to show no similarity to any gene outside or within the species or taxonomic group of interest, which would suggest that the candidate gene emerged via a recycling mechanism, such as duplication. The inclusion of a greater number of outgroup species in the analysis leads to more robust results.
Protein sequences as the default option
The most widely employed method for identifying homologs is to use protein sequence similarity for the purpose of database searches. Such searches may encompass proteins from a broad range of species. Distant outgroup species should be also included to rule out horizontal gene transfer and distant homologies. Large databases containing sequence data from all domains of life, such as the NCBI Reference Sequence Database (Pruitt et al. 2005) can be searched to include as many species and taxonomic groups as possible. Newly assembled genomes and corresponding proteomes that have not been incorporated into public databases can also be beneficial to search when studying a specific taxon (Fig. 2b).
With transcriptome-based analysis, it is often assumed that de novo candidates are not annotated in the reference genome. Consequently, annotation software might fail to identify homologous genes in outgroup genomes, leading to incomplete outgroup proteomes. In such cases, validation may rely on the subsequent identification of syntenic homologs that lack coding properties (ex ORFs) or show important frameshift, to confirm the absence of possible homologous encoded protein. Alternatively, Vakirlis and McLysaght (2018) propose performing similarity searches of six-frame translations of entire outgroup genomes. This method discards any putative coding homologs in outgroup genomes, including bona fide noncoding homologs that lacks stop, frameshift and transcription. While this approach is likely to be the most effective, it is more suitable for small genomes, as it can be computationally intensive for larger genomes. The homology search is typically conducted using the protein sequence of the genes to be tested. However, there has been an increasing trend in the use of protein structure, in addition to the sequence, depending on the specific biological question being investigated (Alvarez-Carreño et al. 2021; Middendorf et al. 2024; Van Kempen et al. 2024).
Using the DNA sequence to include noncoding RNAs
A homology search can also be performed based on the DNA sequence of candidate de novo genes. This can be useful when looking for homology in noncoding RNA (ncRNAs). In such instances, the direction of the alignment should be considered, as well as the coverage, given that two overlapping transcripts could have originated from distinct promotors (Grandchamp et al. 2023a). Furthermore, according to the biological question, it can be wanted that a de novo gene is not derived from a transposable element (TE), or from an annotated and conserved ncRNA. To address this, the ORF or transcript can be searched for homology against a database, comprising TEs and ncRNAs from query and outgroup species. An important caveat is that, if proteogenomic evidence of translation exists for a given genomic sequence (Slavoff et al. 2013; Chen et al. 2020; Duffy et al. 2022; Mudge et al. 2022) then such direct evidence overrules the similarity with a long noncoding RNA (lncRNA), and may in fact indicate that the lncRNA is in fact coding (Prensner et al. 2021). Importantly, the use of DNA sequences can be problematic for de novo genes that emerged through specific mechanisms such as overprinting or antisense emergence. More precisely, such candidates might exhibit significant DNA similarity with genes they overlap with, leading to their erroneous exclusion from a list of potential de novo genes.
Available tools for sequence similarity searches
Several tools are available to search for homologous sequences. BLAST (Altschul et al. 1990) is commonly used for homology searches and is recommended because of its speed and accuracy. When working with a large database such as the NCBI nr or RefSeq, a faster tool for local alignments than BLAST, such as Diamond (Buchfink et al. 2021), can be used. As de novo genes that show homology to existing proteins should be removed from the dataset of potential de novo genes, the choice of homology criteria is important.
For highly divergent proteins, algorithms such as BLAST have been shown to be prone to false negatives (Moyers and Zhang 2016, 2017; James et al. 2021). Other algorithms based on Hidden Markov Models (HMMs) have demonstrated greater sensitivity in detecting distant homologies. HMM-based approaches, such as PSI-BLAST (Altschul and Koonin 1998), HMMER3 (Finn et al. 2011), and JackHMMER (Johnson et al. 2010), offer improved handling of insertions and deletions (Eddy 2011). For example, GenEra (Barrera-Redondo et al. 2023) integrates HMM-based methods alongside BLAST to enhance the detection of distant homologs.
Different E-value thresholds can be used to assess homology (Vakirlis et al. 2020), even though an e-value of 10e-2 should be the highest tolerated. For example, one might want to be extremely restrictive while studying one single de novo gene involved in a specific function to ensure that it contains no other gene overlap. A more relaxed threshold can be applied if the phylogenetic group includes a lot of species and the homology search is performed against very distant species. An additional measure is the alignment coverage (Long and Langley 1993; McLysaght and Hurst 2016) (Fig. 2b, Table 1).
Predicting protein structures for homology searches
Recent advancements in protein structure prediction, most importantly by AlphaFold2 (Jumper et al. 2021), have led to new opportunities for phylogenetic analyses based on protein structures (Moi et al. 2025). Protein structures exhibit greater conservation compared to their sequences (Illergård et al. 2009), suggesting the potential of putative de novo genes actually representing highly divergent orthologs (Casola 2018). To further confirm a de novo origin, structural similarity searches can be conducted using tools such as Foldseek (Van Kempen et al. 2024). Foldseek enables rapid comparison of structural similarities across a broad range of databases, encompassing both experimental and computationally derived structures. It is important to note that the identification of de novo genes should be based primarily on phylogenomic evidence of their recent emergence from noncoding sequences, rather than on structural uniqueness. Even if a protein product of a confirmed de novo gene shows structural similarity to existing proteins, this does not negate its putative de novo origin. Such structural convergence may reflect similar selective pressures leading to analogous three-dimensional solutions or structures easy to reach within sequence-structure space, despite independent evolutionary origins. However, the commonly used AlphaFold2 (Jumper et al. 2021) primarily relies on co-evolutionary data derived from multiple sequence alignments (MSAs), which are inherently sparse for de novo proteins, impacting the reliability of predictions (Fig. 2b, Table 1) (Jumper et al. 2021; Aubel et al. 2023; Liu et al. 2023). Given this limitation, there has been growing interest in structure predictors that utilize protein language models. These models are supposedly more suitable for predicting the structures of de novo proteins and other orphan proteins, where sequence homologies are limited or nonexistent (Chowdhury et al. 2022; Michaud et al. 2022; Aubel et al. 2023; Lin et al. 2023; Liu et al. 2023; Middendorf and Eicholt 2024). However, it is important to note that both AlphaFold2 and protein language model-based tools, such as ESMfold, have been shown to inaccurately predict structures of de novo proteins, and with discordant confidence scores (Aubel et al. 2023; Middendorf and Eicholt 2024). The most recent implementation of AlphaFold—AlphaFold3 (Abramson et al. 2024)—has yet to be tested for its performance on orphan proteins and de novo emerged proteins. Recent studies have successfully utilized molecular dynamics (MD) simulations as refinement to explore the structural dynamics ofde novo proteins (Lange et al. 2021; Middendorf et al. 2024; Peng and Zhao 2024).
After the homology filtering step, the list of candidate genes is reduced to a list of potential de novo genes, containing only genes that don’t have detected homologs outside the studied taxonomic group.
Noncoding Homologs
The detection of syntenic noncoding sequences, homologous to all potential de novo genes under investigation, in target species or populations that are outgroup to the ones expressing the potential de novo genes, is for now the last step to provide evidence for a de novo emergence. In this review, we define a “noncoding homolog” as a homologous sequence that supports the validation of a de novo gene emergence. However, determining whether a genomic sequence is truly noncoding can be challenging. As a result, several studies define noncoding homologs as sequences lacking an open reading frame (ORF) that could encode a protein homologous to the one produced by the de novo gene (Vakirlis and McLysaght 2018; Sandmann et al. 2023; Wacholder et al. 2023). In such cases, an insertion in the homologous sequence would not necessarily prevent translation, but result in a different frame and with that loss of protein homology.
However, identification of syntenic regions and a coding status can be challenging, and the absence of a “syntenic noncoding homolog” does not necessarily invalidate a de novo origin.
The de novo origin of a potential de novo gene can be suspected under the following conditions:
homologous sequences to the de novo gene can be detected in genome of several target species or populations. Such target species or populations must be outgroup to the phylogenetic group, species or population where the de novo genes under investigation are present.
the identified homologous sequences are noncoding, or would encode a protein sufficiently different from the one encoded by the candidate, for example due to a frameshift early in the sequence.
the identified homologous sequences are in a genomic location that is syntenic to the de novo gene
The following steps are required to detect syntenic noncoding homologs:
Selection of Target Genomes for Synteny Search
In order to identify syntenic noncoding homologs, a set of target genomes must be selected. This set of target genomes will be used to validate or invalidate a de novo emergence for all remaining genes from the previously filtered set. For instance, in the case of studying de novo genes first steps of emergence within a species, the target genomes should be those from individuals or populations of the same species that do not contain the de novo gene(s) of interest. Conversely, when searching for de novo genes specific to a taxonomic group that includes several species, the target genomes should be closely related to that taxonomic group, but have diverged earlier than the root of this group. The optimal number of target genomes required for the identification of noncoding homologs remains undetermined; however, it is generally accepted that the greater the number of genomes analyzed, the more robust the conclusions drawn (Fig. 2c Table 1).
The choice of outgroup genomes is not only crucial but also requires careful interpretation of the results. When the selected outgroups are too distantly related to the query genome, the absence of homologous sequences should not be taken as definitive evidence for bannishing a de novo origin. More critically, when only a few distantly related and poorly annotated genomes are available, the syntenic regions may span large genomic intervals due to a low density of annotated genes. In such cases, the likelihood of identifying false nongenic homologs increases, particularly given the small size of many de novo genes. As demonstrated by Roginski et al. (2024), when synteny searches were conducted using large genomic windows (e.g. ten neighboring genes), nearly 100% of candidate de novo genes appeared to have homologs in outgroup genomes. In contrast, using more narrowly defined syntenic regions reduced this number to below 50%, indicating a high rate of false positives in the broader search context.
Homology Search between the Query de novo Gene and the Target Genomes
Once the target species have been identified, genomic sequences homologous to the potential de novo gene can be searched for. During this step, the homology search is performed against the genome of all target species. One option is to use tBLASTn, by using the de novo translated ORF as a query (Vakirlis and McLysaght 2018). However, the most precise option to detect homologous sequences independently of their frame of translation is to use BLASTn. If the ORF is small, and if the unspliced gene contains one or several introns, an option is to use the unspliced ORF as a query for a nucleotide BLAST against the target genome, and then splice the resulting alignment (Grandchamp et al. 2023b). If the target genome belongs to a species that is phylogenetically distant from the query species, alignment programs that allow more divergence such as exonerate (Slater and Birney 2005) can also be used to search for homology.
Search for Syntenic Regions
Genomic synteny refers to the conservation of genomic fragments within two genomes or chromosomes. If one or several homologous hits have been detected for a single query de novo gene, some of these hits can be further validated in each target species by confirming their location in a genomic region that is syntenic to the de novo gene. This step can also be performed in reverse with the previous one, meaning that the search of homologous sequences could also be performed only in syntenic regions.
Methods for synteny detection
There are numerous methods available for synteny detection. Synteny can be compared between two complete genomes by fragmenting each chromosome into blocks based on sequence fragments, motifs, domains, etc., and determining similarity and location between blocks (Wang et al. 2012; Liu et al. 2018). Synteny can also be examined at a genic level by studying the conservation of the order of syntenic genes between genomes. In such cases, genes are selected as anchors to determine synteny, and the detection of synteny is based on gene orthology. For instance, SynChro (Drillon et al. 2014) and Synima (Farrer 2017) are software tools that detect synteny using reciprocal BLAST hits between genes from different genomes. Using genes as anchors for synteny is a rapid and effective approach when searching for syntenic hits of de novo genes that are intergenic (Vakirlis et al. 2020; Roginski et al. 2024). The genes neighboring the de novo gene are chosen as anchors and investigated for orthology in the target genome. If the noncoding homolog is flanked by genes orthologous to those surrounding the query de novo gene, the synteny is confirmed. The number of anchor genes can be adjusted based on the context. When working within populations or individuals of a single species or closely related species, a stringent requirement for complete synteny may be imposed. In such cases, noncoding sequences homologous to the candidate de novo gene are collected only if they are positioned between two genes homologous to those surrounding the query candidate. Other approaches also exist for synteny detection.Käther et al. (2023) introduced an approach called “Annotation-Free Identification of Potential Synteny Anchors” that does not rely on genes as anchors. Zhao and Schranz (2017) suggested using network approaches to infer synteny. One of the best ways to validate synteny is to use whole-genome alignments. In such cases, the genomic region of target genomes that aligns to the de novo candidate from the query genome corresponds to the syntenic homolog. For instance, Wacholder et al. (2023) aligned syntenic conserved blocks to precisely locate the coordinates of noncoding homologs compared to candidate de novo genes in yeasts. Similarly, Sandmann et al. (2023) used a whole-genome alignment of 120 mammalian species and another alignment of 27 primate species to search for noncoding sequences homologous to human-translated micropeptides. Whole-genome alignments have also been used to identify de novo genes in Drosophila (Peng and Zhao 2024), though some appear to have been overlooked (Guay et al. 2025). Overall, whole-genome alignments are highly reliable but require several, high-quality genomes, which are often not available.
Caveats when using synteny
While validating synteny between de novo candidates and homologous sequences is necessary, this steps also is affected by methodological limitations. The definition and conservation of synteny depends on several criteria, such as the quality of genome annotation, alignments, and the selection of syntenic anchors, windows, and algorithms. Liu et al. (2018) demonstrated that synteny between species can be underestimated by up to 40% depending on the methodology chosen. Moreover, once a syntenic block is detected between a query and a target genome, the identification of a noncoding homolog also depends on the methodology. Therefore, the methodology used to detect and define synteny can vary from one project to another, leading to variable conclusions. Independently of the method used, the phylogenetic distance between the query genomes and selected target species influences synteny conservation: the greater the distance between genomes, the less conserved the synteny (Lemoine et al. 2007). For instance, macrosynteny tends to be preserved for approximately 10–100 million years, whereas microsynteny can remain conserved over several hundred million years. For example, many genes are syntenic within Chordates and Arthropods, each of which emerged around 560 million years ago (mya), but not between the two phyla (Vonica et al. 2020), which diverged approximately 708 mya (Kumar et al. 2022). Furthermore, synteny conservation can vary among taxa (e.g. plants, animals) even for similar phylogenetic distances (Roginski et al. 2024). Moreover the detection of syntenic noncoding sequences homologous to de novo genes often fails due to factors such as extensive genomic rearrangements. When validation of de novo emergence through the detection of a noncoding homolog cannot be achieved, drawing conclusions about de novo emergence becomes challenging. Some genes that emerge after a duplication event have been observed to evolve rapidly, diverging from their original sequence to an extent that no homology tool can reliably predict their origin (Gu et al. 2005; Pegueroles et al. 2013; Naseeb et al. 2017; Casola 2018; O’Toole et al. 2018). Consequently, such genes may exhibit no homology to any other annotated gene and could be mistakenly identified as de novo genes, in the absence of noncoding homolog (Weisman et al. 2020).
Assess the Coding Status of the Detected Homologous Sequences
Once a syntenic homolog of a potential de novo gene has been detected, the final step is to determine its coding status. To do so, the query sequence and its homolog are often re-aligned before deeper investigation (Sandmann et al. 2023; Wacholder et al. 2023; Peng et al. 2024). If one homolog shares the same coding properties as the potential de novo gene, then such gene did not emerge de novo, or at least not prior to the divergence of the two studied species (query and target). On the other hand, if all homologous sequences are noncoding, then the de novo origin of the de novo candidate under investigation is assumed as the “most likely” in the query species.
Assessing the coding/noncoding status of detected homologs remains the most challenging step of the entire pipeline. Several properties can be assessed to compare the coding status of the sequence homologous to the potential de novo gene, such as the presence of start and stop codons, premature stop codons, frameshift mutations, and splice sites in the case of introns (Grandchamp et al. 2023b). However, the question remains: are these features, or their absence, sufficient to validate or invalidate a coding gene status? For example, the absence of an ATG start codon in a noncoding homolog to a de novo candidate does not necessarily prevent translation, as several weaker start codons have been shown to be adequate for translation (Cao and Slavoff 2020), with some being conserved across evolution (Bazykin and Kochetov 2011). More precisely, several small peptides have been shown to be often encoded by sORFs with non-AUG start codons (Peng et al. 2024). Wacholder et al. (2023) emphasize frameshift mutations as crucial features to consider, since the position of a frameshift in a putative noncoding homolog can significantly affect the divergence from the de novo candidate if both are translated. In Sandmann et al. (2023), authors translated the homologous ORF, if any, and calculated a score of protein homology.
Evaluating transcription of the noncoding homolog also improves the determination of a genic status. Transcription information is also useful for inferring the emergence of splice sites. Several studies have reported the presence of introns in de novo genes (Wu et al. 2011; Zhang et al. 2019; Grandchamp et al. 2022). Studying the emergence of these introns and the evolution/conservation of their splice sites would be essential, as the loss or gain of splicing could significantly alter the translated protein. To the best of our knowledge, such a study has not yet been conducted.
This last step must be conducted with caution, as it can lead to significant misinterpretations. Robust conclusions can only be acquired if several strategic target genomes are selected—the more, the better. The transition from a noncoding sequence to a protein-coding gene follows various steps (McLysaght and Guerzoni 2015; Ruiz-Orera et al. 2017). All mutations and transitions can occur in different orders (Carvunis et al. 2012; Iyengar et al. 2024; Lebherz et al. 2024b). More importantly, the process of acquiring a coding status can go back and forth during evolution, as the initial stages of de novo emergence are a priori not subject to selection pressures (Carvunis et al. 2012; Iyengar and Bornberg-Bauer 2023). Therefore, the detection of noncoding sequences homologous to a candidate de novo gene, can only be valuable if such a noncoding status is confirmed in several target, as a coding homolog could hypothetically also be detected in more divergent species that were not studied (Fig. 2c Table 1).
After all these steps, among the set of potential de novo genes under investigation, the ones that have noncoding syntenic homologs in all target genomes can be validated as de novo genes.
Evolutionary Information
What selective pressures apply on a de novo gene? According to the model proposed in 2012 (Carvunis et al. 2012), the emergence of a new gene from a noncoding sequence involves two main steps: the first is the emergence of a proto-gene, which is a transcribed and translated ORF whose genomic sequence is not yet under selection, producing a small peptide that is likely gained and lost through evolution. The second stage is when a proto-gene becomes fixed in a species due to selection, achieving the status of a de novo gene (Van Oss and Carvunis 2019). It is challenging to determine whether a de novo gene is fixed in a species, and by that gaining a de novo gene status, or whether it is not yet fixed, classifying the gene as a proto-gene. Measurements of selection pressures can be used (Feldmeyer et al. 2024) to distinguish between these two. Moreover, the method used to detect de novo genes influences of which type the majority of candidate genes are.
De novo genes extracted from an annotated genome are likely to become fixed or are fixed already, as their coding features are robust enough to be detected by standard annotation methods. Several studies have demonstrated that de novo genes extracted from annotated genomes are under purifying selection both within and between species (Li et al. 2010; Palmieri et al. 2014). Moreover, specific codons have been shown to be enriched in such de novo genes (Hershberg and Petrov 2008; Wallace et al. 2013; Schlötterer 2015).
Assessing de novo genes extracted from transcriptomes and/or proteomes is more challenging. Labeling such sequences as de novo genes should be supported by evidence of purifying selection, conservation within populations of a species and translational evidence. If no selection tests are performed, the term proto-gene is most commonly used. The term ORFans (Vakirlis and McLysaght 2018) or newly expressed ORFs (Grandchamp et al. 2023b) is used for ORFs that were extracted from transcriptomes without evidence of translation. Newly translated ORFs is the commonly used term for ORFs with evidence of translation whose level of transcription is unknown. However, the validation of a de novo status does not have to be supported by all these conditions. For instance, in the case of genes annotated by ab initio methods, evidence of transcription is generally not provided, unless additional laboratory experiments are conducted. Moreover, ab initio and homology-based methods do not provide evidence of selection for the identified genes (Burge and Karlin 1997; Kryazhimskiy and Plotkin 2008). Conversely, if an unannotated ORF exhibits direct evidence of both transcription and translation, there is no conceptually valid reason to apply more restrictive criteria than for canonical genes.
Unfortunately, assessing evidence of selection in de novo genes remains extremely challenging (Fig. 2d Table 1). Selection pressure is often assessed using metrics such as the dN/dS ratio (Yang and Bielawski 2000; Hurst 2002; Kosakovsky Pond and Frost 2005) or the pN/pS ratio (McDonald and Kreitman 1991). However, both of these metrics are designed for coding sequences. Therefore, the presence of noncoding homologs or noncoding variants of a de novo emerged ORF poses problems for their calculation. While these difficulties do not prevent the study of selection among all coding samples of a de novo emerged ORF, a future challenge would be to incorporate noncoding sequences into a calculation of selective pressure, to gain a clearer understanding of selection dynamics in the earliest stages of emergence.
Lastly, most de novo ORFs are shorter than canonical ORFs and are present in a limited number of species or populations, which limits the statistical power to confidently detect selection (Wacholder et al. 2023). Several studies have addressed the challenge of assessing selection on de novo emerged ORFs. For example, Ward and Kellis (2012) attempted to understand whether the large portion of the human genome that is biochemically active shows evidence of purifying selection. By using genome alignments and studying sequence conservation, they found that 4% of the human genome is subject to lineage-specific constraint, in addition to the 5% already known. In 2003, Kellis et al. (2003) developed a reading frame conservation (RFC) test to classify all ORFs of S. cerevisiae as either biologically meaningful or meaningless. This RFC test was later adapted by Wacholder et al. (2023) to distinguish ORFs evolving under selection from other ORFs in the yeast genome particularly those showing weak signals in more classical selection tests. While they found no evidence of purifying selection acting on most of these de novo emerged ORFs, a few samples showed selection.
Available Software
The identification of de novo genes is contingent on numerous methodological decisions, with custom scripts or programs frequently required for multiple steps in the process. Fortunately, recent advancements have led to the publication of various tools and software that automate de novo gene detection, either completely or partially. Singh and Wurtele (2021) developed orfipy, which facilitates the detection of ORFs in new transcriptomes that can be used subsequently to search for de novo genes in transcriptomic data. The R package phylostratr (Arendsee et al. 2019b) allows to infer a phylostratum for all input query genes, thereby enabling the identification of homology to a candidate gene. GenEra (Barrera-Redondo et al. 2023) allows to detect taxonomically restricted genes. The softwares fagin partially automate (Arendsee et al. 2019a) and DENSE (Roginski et al. 2024) automate the detection of de novo genes in an annotated genome. An automated tool for detection of de novo genes based on transcriptomic data is unfortunately not yet available.
Challenges & Conclusions
In conclusion, despite significant advances in understanding de novo gene emergence, two major challenges remain. Firstly, current methods for detecting de novo genes are largely limited to evolutionary young genes, making it difficult to discern the origins of ancient genes within large and complex gene families. This limitation stems from the fact that existing approaches can only trace the recent origin of a gene, which becomes increasingly challenging as the gene ages and undergoes multiple rounds of duplication and divergence of sequence and function. As a result, our current understanding of de novo gene emergence is biased towards recently evolved genes, leaving a significant gap in our knowledge of how older de novo genes originated. Novel approaches for remote homology detection and improved structure predictions could help us address this bias in the future.
Secondly, the lack of standardization in methodology and terminology hinders comparability between studies, with different approaches and thresholds yielding disparate results even when analyzing the same species. We address this problem directly in our accompanying paper by providing a standardized annotation format based on the identified classifications described in this review. Such a standardized annotation format represents a crucial step towards achieving a common framework, enabling researchers to compare and build upon each other’s work more effectively.
By establishing a common framework for describing, analyzing and comparing de novo gene studies, we can enhance reproducibility, comparability, and ultimately, drive progress in this rapidly evolving field. Albeit the remaining challenges in this young field, our work paves the way for future studies to refine methods and integrate de novo gene searches into standard gene annotation pipelines, unlocking new biological insights into the origins of genes.
Acknowledgments
We thank Erich Bornberg-Bauer (E.B.-B.), Nikolaos Vakirlis, Li Zhao, Anne Lopes, Bharat Ravi Iyengar, and Andreas Lange for their useful feedback during the planning phase of the project. E.D. was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)—503348080. This grant with the additional grant number BO2544-22-1 was awarded to E.B.-B. M.A. received funding from the Volkswagen foundation with grant code 98183 awarded to E.B.-B. The work by A.G. was supported by the Deutsche Forschungsgemeinschaft priority program “Genomic Basis of Evolutionary Innovations” (SPP 2349) BO 2544/20-1 to E.B.-B, and by the Human Frontier Science Program Research Grant RGP004/2023 (doi.org/10.52044/HFSP.RGP0042023.pc.gr.168590) awarded to Erich Bornberg-Bauer, Anne-Ruxandra Carvunis and Christine Brun. L.A.E. has been supported by EMBO Scientific Exchange Grant 10944. V.L. was supported by NIH grant R01NS095654 (to Nenad Sestan). We acknowledge support from the Open Access Publication Fund of the University of Münster.
Contributor Information
Anna Grandchamp, Aix Marseille University, INSERM, TAGC institute, UMR_S1090, 13288 Marseille, France.
Margaux Aubel, Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany.
Lars A Eicholt, Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany.
Paul Roginski, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, CEA, CNRS, Gif-sur-Yvette 91198, France.
Victor Luria, Department of Neuroscience, Yale School of Medicine, New Haven, CT 06510, USA; Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA; Division of Genetics and Genomics, Boston Children’s Hospital, Harvard Medical School, Boston, MA 02115, USA.
Amir Karger, IT-Research Computing, Harvard Medical School, Boston, MA 02115, USA.
Elias Dohmen, Institute for Evolution and Biodiversity, University of Münster, Münster 48149, Germany.
Author Contributions
A.G. and E.D. were responsible for the conceptualization of the review and handled project administration. A.G. wrote the first draft of the review. M.A., L.E., and E.D. restructured the text and implemented sections. M.A. edited the figures. P.R., V.L., and A.K. provided feedback and modifications on the text. The final manuscript was edited and reviewed by all authors. Erich Bornberg-Bauer provided general administrative support and acquisition of the financial support for the project leading to this publication.
Data Availability
All data and source code for the developed annotation format based on this work can be found at https://github.com/EDohmen/denofo.
Literature Cited
- Abramson J et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024:630:493–500. ISSN 1476-4687. 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990:215:403–410. 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- Altschul SF, Koonin EV. Iterated profile searches with psi-blast—a tool for discovery in protein databases. Trends Biochem Sci. 1998:23:444–447. 10.1016/S0968-0004(98)01298-5. [DOI] [PubMed] [Google Scholar]
- Alvarez-Carreño C, Penev PI, Petrov AS, Williams LD. Fold evolution before LUCA: common ancestry of SH3 domains and OB domains. Mol Biol Evol. 2021:38:5134–5143. ISSN 1537-1719. 10.1093/molbev/msab240. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Andjus S et al. Pervasive translation of Xrn1-sensitive unstable long noncoding RNAs in yeast. RNA. 2024:30:662–679. 10.1261/rna.079903.123. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ardern Z. Alternative reading frames are an underappreciated source of protein sequence novelty. J Mol Evol. 2023:91:570–580. 10.1007/s00239-023-10122-3. [DOI] [PubMed] [Google Scholar]
- Ardern Z, Neuhaus K, Scherer S. Are antisense proteins in prokaryotes functional? Front Mol Biosci. 2020:7:187. 10.3389/fmolb.2020.00187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arendsee Z, Li J, Singh U, Bhandary P, Seetharam A, Wurtele ES. Fagin: synteny-based phylostratigraphy and finer classification of young genes. BMC Bioinformatics. 2019a:20:1–14. 10.1186/s12859-019-3023-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Arendsee Z, Li J, Singh U, Seetharam A, Dorman K, Wurtele ES. phylostratr: a framework for phylostratigraphy. Bioinformatics. 2019b:35:3617–3627. 10.1093/bioinformatics/btz171. [DOI] [PubMed] [Google Scholar]
- Aubel M, Eicholt L, Bornberg-Bauer E. Assessing structure and disorder prediction tools for de novo emerged proteins in the age of machine learning. F1000Res. 2023:12:347. 10.12688/f1000research. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baker L, David C, Jacobs DJ. Ab initio gene prediction for protein-coding regions. Bioinform Adv. 2023:3:vbad105. 10.1093/bioadv/vbad105. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Baltimore D. Viral rna-dependent dna polymerase: RNA-dependent DNA polymerase in virions of RNA tumour viruses. Nature. 1970:226:1209–1211. 10.1038/2261209a0. [DOI] [PubMed] [Google Scholar]
- Barman P, Reddy D, Bhaumik SR. Mechanisms of antisense transcription initiation with implications in gene expression, genomic integrity and disease pathogenesis. Noncoding RNA. 2019:5:11. 10.3390/ncrna5010011. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Barrera-Redondo J, Lotharukpong JS, Drost H-G, Coelho SM. Uncovering gene-family founder events during major evolutionary transitions in animals, plants and fungi using genera. Genome Biol. 2023:24:54. 10.1186/s13059-023-02895-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bazykin GA, Kochetov AV. Alternative translation start sites are conserved in eukaryotic genomes. Nucleic Acids Res. 2011:39:567–577. 10.1093/nar/gkq806. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bazzini AA et al. Identification of small ORF s in vertebrates using ribosome footprinting and evolutionary conservation. EMBO J. 2014:33:981–993. 10.1002/embj.201488411. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Begun DJ, Lindfors HA, Kern AD, Jones CD. Evidence for de novo evolution of testis-expressed genes in the Drosophila yakuba/Drosophila erecta clade. Genetics. 2007:176:1131–1137. 10.1534/genetics.106.069245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Begun DJ, Lindfors HA, Thompson ME, Holloway AK. Recently evolved genes identified from Drosophila yakuba and D. erecta accessory gland expressed sequence tags. Genetics. 2006:172:1675–1681. 10.1534/genetics.105.050336. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Blevins WR et al. Uncovering de novo gene birth in yeast using deep transcriptomics. Nat Commun. 2021:12:604. 10.1038/s41467-021-20911-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Broeils LA, Ruiz-Orera J, Snel B, Hubner N, van Heesch S. Evolution and implications of de novo genes in humans. Nat Ecol Evol. 2023:7:804–815. 10.1038/s41559-023-02014-y. [DOI] [PubMed] [Google Scholar]
- Buchfink B, Reuter K, Drost H-G. Sensitive protein alignments at tree-of-life scale using diamond. Nat Methods. 2021:18:366–368. 10.1038/s41592-021-01101-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Burge C, Karlin S. Prediction of complete gene structures in human genomic dna. J Mol Biol. 1997:268:78–94. 10.1006/jmbi.1997.0951. [DOI] [PubMed] [Google Scholar]
- Cai J, Zhao R, Jiang H, Wang W. De novo origination of a new protein-coding gene in saccharomyces cerevisiae. Genetics. 2008:179:487–496. 10.1534/genetics.107.084491. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Cao X, Slavoff SA. Non-AUG start codons: expanding and regulating the small and alternative ORFeome. Exp Cell Res. 2020:391:111973. 10.1016/j.yexcr.2020.111973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carter JJ et al. Identification of an overprinting gene in merkel cell polyomavirus provides evolutionary insight into the birth of viral genes. Proc Natl Acad Sci U S A. 2013:110:12744–12749. 10.1073/pnas.1303526110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Carvunis A-R et al. Proto-genes and de novo gene birth. Nature. 2012:487:370–374. 10.1038/nature11184. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casola C. From de novo to “de nono”: the majority of novel protein-coding genes identified with phylostratigraphy are old genes or recent duplicates. Genome Biol Evol. 2018:10:2906–2918. 10.1093/gbe/evy231. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chatterjee S, Pal JK. Role of 5-and 3-untranslated regions of mRNAs in human diseases. Biol Cell. 2009:101:251–262. 10.1042/BC20080104. [DOI] [PubMed] [Google Scholar]
- Chen J et al. Pervasive functional translation of noncanonical human open reading frames. Science. 2020:367:1140–1146. 10.1126/science.aay0262. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen J-Y et al. Emergence, retention and selection: a trilogy of origination for functional de novo proteins from ancestral LncRNAs in primates. PLoS Genet. 2015:11:e1005391. ISSN 1553-7404. Publisher: Public Library of Science. 10.1371/journal.pgen.1005391. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chowdhury R et al. Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol. 2022:40:1617–1623. 10.1038/s41587-022-01432-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Clark MB et al. The reality of pervasive transcription. PLoS Biol. 2011:9:e1000625. 10.1371/journal.pbio.1000625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Coffin JM, Fan H. The discovery of reverse transcriptase. Annu Rev Virol. 2016:3:29–51. 10.1146/virology.2016.3.issue-1. [DOI] [PubMed] [Google Scholar]
- Cordaux R, Udit S, Batzer MA, Feschotte C. Birth of a chimeric primate gene by capture of the transposase gene from a mobile element. Proc Natl Acad Sci U S A. 2006:103:8101–8106. 10.1073/pnas.0601161103. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delaye L, DeLuna A, Lazcano A, Becerra A. The origin of a novel gene through overprinting in Escherichia coli. BMC Evol Biol. 2008:8:31. 10.1186/1471-2148-8-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Delihas N. An ancestral genomic sequence that serves as a nucleation site for de novo gene birth. PLoS One. 2022:17:e0267864. 10.1371/journal.pone.0267864. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dinger ME, Pang KC, Mercer TR, Mattick JS. Differentiating protein-coding and noncoding RNA: challenges and ambiguities. PLoS Comput Biol. 2008:4:e1000176. 10.1371/journal.pcbi.1000176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dohmen E et al. Denofo: a file format and toolkit for standardised, comparable de novo gene annotation. Bioinformatics. 2025:41:btaf539. 10.1093/bioinformatics/btaf539. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domazet-Lošo T et al. No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution. Mol Biol Evol. 2017:34:843–856. ISSN 0737-4038. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5400388/. 10.1093/molbev/msw284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Domazet-Loso T, Tautz D. An evolutionary analysis of orphan genes in Drosophila. Genome Res. 2003:13:2213–2219. ISSN 1088-9051, 1549-5469. 10.1101/gr.1311003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dowling D, Schmitz JF, Bornberg-Bauer E. Stochastic gain and loss of novel transcribed open reading frames in the human lineage. Genome Biol Evol. 2020:12:2183–2195. 10.1093/gbe/evaa194. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Drillon G, Carbone A, Fischer G. Synchro: a fast and easy tool to reconstruct and visualize synteny blocks along eukaryotic chromosomes. PLoS One. 2014:9:e92621. 10.1371/journal.pone.0092621. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Duffy EE et al. Developmental dynamics of rna translation in the human brain. Nat Neurosci. 2022:25:1353–1365. 10.1038/s41593-022-01164-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Eddy SR. A new generation of homology search tools based on probabilistic inference. In: Genome Informatics 2009: Genome Informatics Series. Vol. 23. World Scientific; 2009. p. 205–211.
- Eddy SR. Accelerated profile HMM searches. PLoS Comput Biol. 2011:7:e1002195. 10.1371/journal.pcbi.1002195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Farrer RA. Synima: a synteny imaging tool for annotated genome assemblies. BMC Bioinformatics. 2017:18:1–4. 10.1186/s12859-017-1939-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feldmeyer B et al. Comparative evolutionary genomics in insects. Methods Mol Biol. 2024:2802:473–514. 10.1007/978-1-0716-3838-5. [DOI] [PubMed] [Google Scholar]
- Finn RD, Clements J, Eddy SR. Hmmer web server: interactive sequence similarity searching. Nucleic Acids Res. 2011:39:W29–W37. 10.1093/nar/gkr367. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Freeman VJ. Studies on the virulence of bacteriophage-infected strains of corynebacterium diphtheriae. J Bacteriol. 1951:61:675–688. 10.1128/jb.61.6.675-688.1951. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gilbert W. Why genes in pieces? Nature. 1978:271:501–501. 10.1038/271501a0. [DOI] [PubMed] [Google Scholar]
- Grandchamp A, Berk K, Dohmen E, Bornberg-Bauer E. New genomic signals underlying the emergence of human proto-genes. Genes (Basel). 2022:13:284. 10.3390/genes13020284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grandchamp A, Czuppon P, Bornberg-Bauer E. Quantification and modeling of turnover dynamics of de novo transcripts in drosophila melanogaster. Nucleic Acids Res. 2023a:52:274–287. 10.1093/nar/gkad1079. [DOI] [Google Scholar]
- Grandchamp A, Kühl L, Lebherz M, Brüggemann K, Parsch J, Bornberg-Bauer E. Population genomics reveals mechanisms and dynamics of de novo expressed open reading frame emergence in Drosophila melanogaster. Genome Res. 2023b:33:872–890. 10.1101/gr.277482.122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Grandchamp A, Lebherz MK, Dohmen E. 2025. Detect de novo expressed orfs in transcriptomes with deswoman [preprint]. bioRxiv. 10.1101/2025.06.10.658796. [DOI]
- Gry M et al. Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics. 2009:10:1–14. 10.1186/1471-2164-10-365. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gu X, Zhang Z, Huang W. Rapid evolution of expression and regulatory divergences after yeast gene duplication. Proc Natl Acad Sci U S A. 2005:102:707–712. 10.1073/pnas.0409186102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guay SY et al. An orphan gene is essential for efficient sperm entry into eggs in Drosophila melanogaster. Genetics. 2025:229:iyaf008. 10.1093/genetics/iyaf008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gubala AM et al. The goddard and saturn genes are essential for Drosophila male fertility and may have arisen de novo. Mol Biol Evol. 2017:34:1066–1082. 10.1093/molbev/msx057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo W-J, Li P, Ling J, Ye S-P. Significant comparative characteristics between orphan and nonorphan genes in the rice (oryza sativa l.) genome. Int J Genomics. 2007:2007:021676. 10.1155/2007/21676. [DOI] [Google Scholar]
- Hangauer MJ, Vaughn IW, McManus MT. Pervasive transcription of the human genome produces thousands of previously unidentified long intergenic noncoding RNAs. PLoS Genet. 2013:9:e1003569. 10.1371/journal.pgen.1003569. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heames B, Schmitz J, Bornberg-Bauer E. A continuum of evolving de novo genes drives protein-coding novelty in drosophila. J Mol Evol. 2020:88:382–398. 10.1007/s00239-020-09939-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Heinen TJ, Staubach F, Häming D, Tautz D. Emergence of a new gene from an intergenic region. Curr Biol. 2009:19:1527–1531. 10.1016/j.cub.2009.07.049. [DOI] [PubMed] [Google Scholar]
- Hershberg R, Petrov DA. Selection on codon bias. Annu Rev Genet. 2008:42:287–299. 10.1146/genet.2008.42.issue-1. [DOI] [PubMed] [Google Scholar]
- Hurst LD. The ka/ks ratio: diagnosing the form of sequence evolution. Trends Genet. 2002:18:486–487. 10.1016/S0168-9525(02)02722-1. [DOI] [PubMed] [Google Scholar]
- Illergård K, Ardell DH, Elofsson A. Structure is three to ten times more conserved than sequence–a study of structural response in protein cores. Proteins. 2009:77:499–508. 10.1002/prot.v77:3. [DOI] [PubMed] [Google Scholar]
- Ingolia NT, Ghaemmaghami S, Newman JR, Weissman JS. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science. 2009:324:218–223. 10.1126/science.1168978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ingolia NT, Lareau LF, Weissman JS. Ribosome profiling of mouse embryonic stem cells reveals the complexity and dynamics of mammalian proteomes. Cell. 2011:147:789–802. 10.1016/j.cell.2011.10.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Innan H, Kondrashov F. The evolution of gene duplications: classifying and distinguishing between models. Nat Rev Genet. 2010:11:97–108. 10.1038/nrg2689. [DOI] [PubMed] [Google Scholar]
- Iyengar BR, Bornberg-Bauer E. Neutral models of de novo gene emergence suggest that gene evolution has a preferred trajectory. Mol Biol Evol. 2023:40:msad079. 10.1093/molbev/msad079. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Iyengar BR, Grandchamp A, Bornberg-Bauer E. How antisense transcripts can evolve to encode novel proteins. Nat Commun. 2024:15:6187. 10.1038/s41467-024-50550-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- James JE, Willis SM, Nelson PG, Weibel C, Kosinski LJ, Masel J. Universal and taxon-specific trends in protein sequences as a function of age. Elife. 2021:10:e57347. 10.7554/eLife.57347. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Janssen P et al. The effect of background noise and its removal on the analysis of single-cell expression data. Genome Biol. 2023:24:140. 10.1186/s13059-023-02978-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ji Z, Song R, Regev A, Struhl K. Many lncrnas, 5’utrs, and pseudogenes are translated and some are likely to express functional proteins. Elife. 2015:4:e08890. 10.7554/eLife.08890. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jin G-H et al. Genetic innovations: transposable element recruitment and de novo formation lead to the birth of orphan genes in the rice genome. J Syst Evol. 2021:59:341–351. 10.1111/jse.v59.2. [DOI] [Google Scholar]
- Johnson LS, Eddy SR, Portugaly E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinformatics. 2010:11:1–8. 10.1186/1471-2105-11-431. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper J et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021:596:583–589. 10.1038/s41586-021-03819-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kang Y-J et al. CPC2: a fast and accurate coding potential calculator based on sequence intrinsic features. Nucleic Acids Res. 2017:45:W12–W16. 10.1093/nar/gkx428. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Käther K, Lemke S, Stadler PF. Annotation-free identification of potential synteny anchors. In: International Work-Conference on Bioinformatics and Biomedical Engineering. Springer; 2023. p. 217–230.
- Keeling DM, Garza P, Nartey CM, Carvunis A-R. The meanings of’function’in biology and the problematic case of de novo gene emergence. Elife. 2019:8:e47014. 10.7554/eLife.47014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Keese PK, Gibbs A. Origins of genes: “big bang” or continuous creation? Proc Natl Acad Sci U S A. 1992:89:9489–9493. 10.1073/pnas.89.20.9489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature. 2003:423:241–254. 10.1038/nature01644. [DOI] [PubMed] [Google Scholar]
- Kellis M et al. Defining functional dna elements in the human genome. Proc Natl Acad Sci U S A. 2014:111:6131–6138. 10.1073/pnas.1318948111. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kent WJ. Blat—the blast-like alignment tool. Genome Res. 2002:12:656–664. 10.1101/gr.229202. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kondo T et al. Small peptides switch the transcriptional activity of shavenbaby during drosophila embryogenesis. Science. 2010:329:336–339. 10.1126/science.1188158. [DOI] [PubMed] [Google Scholar]
- Koralewski TE, Krutovsky KV. Evolution of exon-intron structure and alternative splicing. PLoS One. 2011:6:e18055. 10.1371/journal.pone.0018055. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kosakovsky Pond SL, Frost SD. Not so different after all: a comparison of methods for detecting amino acid sites under selection. Mol Biol Evol. 2005:22:1208–1222. 10.1093/molbev/msi105. [DOI] [PubMed] [Google Scholar]
- Koussounadis A, Langdon SP, Um IH, Harrison DJ, Smith VA. Relationship between differentially expressed mRNA and mRNA-protein correlations in a xenograft model system. Sci Rep. 2015:5:10775. 10.1038/srep10775. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kovaka S, Zimin AV, Pertea GM, Razaghi R, Salzberg SL, Pertea M. Transcriptome assembly from long-read RNA-seq alignments with stringtie2. Genome Biol. 2019:20:1–13. 10.1186/s13059-019-1910-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kozak M. The scanning model for translation: an update. J Cell Biol. 1989:108:229–241. 10.1083/jcb.108.2.229. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kryazhimskiy S, Plotkin JB. The population genetics of dN/dS. PLoS Genet. 2008:4:e1000304. 10.1371/journal.pgen.1000304. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kumar S et al. Timetree 5: an expanded resource for species divergence times. Mol Biol Evol. 2022:39:msac174. 10.1093/molbev/msac174. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lange A et al. Structural and functional characterization of a putative de novo gene in Drosophila. Nat Commun. 2021:12:1667. 10.1038/s41467-021-21667-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lebherz MK, Fouks B, Schmidt J, Bornberg-Bauer E, Grandchamp A. Dna transposons favor de novo transcript emergence through enrichment of transcription factor binding motifs. Genome Biol Evol. 2024a:16:evae134. 10.1093/gbe/evae134. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lebherz MK, Ravi Iyengar B, Bornberg-Bauer E. Modeling length changes in de novo ORFs during neutral evolution. Genome Biol Evol. 2024b:16:evae129. 10.1093/gbe/evae129. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee BY, Kim J, Lee J. Intraspecific de novo gene birth revealed by presence–absence variant genes in caenorhabditis elegans. NAR Genom Bioinform. 2022:4:lqac031. 10.1093/nargab/lqac031. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lemoine F, Lespinet O, Labedan B. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol Biol. 2007:7:1–18. 10.1186/1471-2148-7-237. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leong AZ-X, Lee PY, Mohtar MA, Syafruddin SE, Pung Y-F, Low TY. Short open reading frames (sORFs) and microproteins: an update on their identification and validation measures. J Biomed Sci. 2022:29:19. 10.1186/s12929-022-00802-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li C-Y et al. A human-specific de novo protein-coding gene associated with human brain functions. PLoS Comput Biol. 2010:6:e1000734. 10.1371/journal.pcbi.1000734. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li J, Liu C. Coding or noncoding, the converging concepts of RNAs. Front Genet. 2019:10:496. 10.3389/fgene.2019.00496. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li J et al. Foster thy young: enhanced prediction of orphan genes in assembled genomes. Nucleic Acids Res. 2021:50:e37. ISSN 0305-1048. 10.1093/nar/gkab1238. [DOI] [Google Scholar]
- Liang X-H, Shen W, Sun H, Migawa MT, Vickers TA, Crooke ST. Translation efficiency of mRNAs is increased by antisense oligonucleotides targeting upstream open reading frames. Nat Biotechnol. 2016:34:875–880. 10.1038/nbt.3589. [DOI] [PubMed] [Google Scholar]
- Lin Z et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023:379:1123–1130. 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
- Liu D, Hunt M, Tsai IJ. Inferring synteny between genome assemblies: a systematic evaluation. BMC Bioinformatics. 2018:19:1–13. 10.1186/s12859-017-2006-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liu J, Yuan R, Shao W, Wang J, Silman I, Sussman JL. Do “newly born” orphan proteins resemble “never born” proteins? A study using three deep learning algorithms. Proteins. 2023:91:1097–1115. 10.1002/prot.v91.8. [DOI] [PubMed] [Google Scholar]
- Liu Y, Beyer A, Aebersold R. On the dependency of cellular protein levels on mRNA abundance. Cell. 2016:165:535–550. 10.1016/j.cell.2016.03.014. [DOI] [PubMed] [Google Scholar]
- Lombardo KD, Sheehy HK, Cridland JM, Begun DJ. Identifying candidate de novo genes expressed in the somatic female reproductive tract of Drosophila melanogaster. G3 (Bethesda). 2023:13:jkad122. 10.1093/g3journal/jkad122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Long M, Langley CH. Natural selection and the origin of jingwei, a chimeric processed functional gene in Drosophila. Science. 1993:260:91–95. Publisher: American Association for the Advancement of Science. 10.1126/science.7682012. [DOI] [PubMed] [Google Scholar]
- Makałowski W, Mitchell GA, Labuda D. Alu sequences in the coding regions of mRNA: a source of protein variability. Trends Genet. 1994:10:188–193. 10.1016/0168-9525(94)90254-2. [DOI] [PubMed] [Google Scholar]
- Matoulkova E, Michalova E, Vojtesek B, Hrstka R. The role of the 3’untranslated region in post-transcriptional regulation of protein expression in mammalian cells. RNA Biol. 2012:9:563–576. 10.4161/rna.20231. [DOI] [PubMed] [Google Scholar]
- McDonald JH, Kreitman M. Adaptive protein evolution at the Adh locus in Drosophila. Nature. 1991:351:652–654. 10.1038/351652a0. [DOI] [PubMed] [Google Scholar]
- McLysaght A, Guerzoni D. New genes from non-coding sequence: the role of de novo protein-coding genes in eukaryotic evolutionary innovation. Philos Trans R Soc Lond B Biol Sci. 2015:370:20140332. 10.1098/rstb.2014.0332. [DOI] [PMC free article] [PubMed] [Google Scholar]
- McLysaght A, Hurst LD. Open questions in the study of de novo genes: what, how and why. Nat Rev Genet. 2016:17:567–578. 10.1038/nrg.2016.78. [DOI] [PubMed] [Google Scholar]
- Merino E, Balbás P, Puente JL, Bolívar F. Antisense overlapping open reading frames in genes from bacteria to humans. Nucleic Acids Res. 1994:22:1903–1908. 10.1093/nar/22.10.1903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Michaud JM, Madani A, Fraser JS. A language model beats AlphaFold2 on orphans. Nat Biotechnol. 2022:40:1576–1577. 10.1038/s41587-022-01466-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Middendorf L, Eicholt LA. Random, de novo, and conserved proteins: how structure and disorder predictors perform differently. Proteins. 2024:92:757–767. 10.1002/prot.v92.6. [DOI] [PubMed] [Google Scholar]
- Middendorf L, Ravi Iyengar B, Eicholt LA. Sequence, structure, and functional space of drosophila de novo proteins. Genome Biol Evol. 2024:16:evae176. 10.1093/gbe/evae176. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mitelman F, Johansson B, Mertens F. The impact of translocations and gene fusions on cancer causation. Nat Rev Cancer. 2007:7:233–245. 10.1038/nrc2091. [DOI] [PubMed] [Google Scholar]
- Moi D, Bernard C, Steinegger M, Nevers Y, Langleib M, Dessimoz C. Structural phylogenetics unravels the evolutionary diversification of communication systems in gram-positive bacteria and their viruses. Nat Struct Mol Biol. 2025:1–11. 10.1038/s41594-025-01649-8. [DOI] [PubMed] [Google Scholar]
- Moldón A, Query C. Crossing the exon. Mol Cell. 2010:38:159–161. 10.1016/j.molcel.2010.04.010. [DOI] [PubMed] [Google Scholar]
- Moyers BA, Zhang J. Phylostratigraphic bias creates spurious patterns of genome evolution. Mol Biol Evol. 2015:32:258–267. ISSN 0737-4038, 1537-1719. 10.1093/molbev/msu286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moyers BA, Zhang J. Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Mol Biol Evol. 2016:33:1245–1256. ISSN 1537-1719. 10.1093/molbev/msw008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moyers BA, Zhang J. Further simulations and analyses demonstrate open problems of phylostratigraphy. Genome Biol Evol. 2017:9:1519–1527. ISSN 1759-6653. 10.1093/gbe/evx109. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Moyers BA, Zhang J. Toward reducing phylostratigraphic errors and biases. Genome Biol Evol. 2018:10:2037–2048. Publisher: Oxford Academic. 10.1093/gbe/evy161. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mudge JM et al. Standardized annotation of translated open reading frames. Nat Biotechnol. 2022:40:994–999. 10.1038/s41587-022-01369-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Naseeb S, Ames RM, Delneri D, Lovell SC. Rapid functional and evolutionary changes follow gene duplication in yeast. Proc R Soc Lond B Biol Sci. 2017:284:20171393. 10.1098/rspb.2017.1393. [DOI] [Google Scholar]
- Neme R, Tautz D. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics. 2013:14:117. 10.1186/1471-2164-14-117. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nieuwenhuis TO, Rosenberg AZ, McCall MN, Halushka MK. Tissue, age, sex, and disease patterns of matrisome expression in gtex transcriptome data. Sci Rep. 2021:11:21549. 10.1038/s41598-021-00943-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nowell PC, Hungerford DA. Chromosome studies on normal and leukemic human leukocytes. J Natl Cancer Inst. 1960:25:85–109. [PubMed] [Google Scholar]
- O’Toole Á.N, Hurst LD, McLysaght A. Faster evolving primate genes are more likely to duplicate. Mol Biol Evol. 2018:35:107–118. 10.1093/molbev/msx270. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ohno S. Evolution by gene duplication. Springer Science & Business Media; 2013.
- Oliva M et al. The impact of sex on gene expression across human tissues. Science. 2020:369:eaba3066. 10.1126/science.aba3066. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Orgogozo V, Peluffo AE, Morizot B. The “mendelian gene” and the “molecular gene”: two relevant concepts of genetic units. Curr Top Dev Biol. 2016:119:1–26. 10.1016/bs.ctdb.2016.03.002. [DOI] [PubMed] [Google Scholar]
- Palmieri N, Kosiol C, Schlötterer C. The life cycle of Drosophila orphan genes. Elife. 2014:3:e01311. 10.7554/eLife.01311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Papadopoulos C et al. The ribosome profiling landscape of yeast reveals a high diversity in pervasive translation. Genome Biol. 2024:25:268. ISSN 1474-760X. 10.1186/s13059-024-03403-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Papadopoulos C et al. Intergenic ORFs as elementary structural modules of de novo gene birth and protein evolution. Genome Res. 2021:31:2303–2315. 10.1101/gr.275638.121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patraquim P, Magny EG, Pueyo JI, Platero AI, Couso JP. Translation and natural selection of micropeptides from long non-canonical RNAs. Nat Commun. 2022:13:6515. ISSN 2041-1723. 10.1038/s41467-022-34094-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Patraquim P, Mumtaz MAS, Pueyo JI, Aspden JL, Couso J-P. Developmental regulation of canonical and small ORF translation from mRNAs. Genome Biol. 2020:21:128. ISSN 1474-760X. 10.1186/s13059-020-02011-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pauli A et al. Toddler: an embryonic signal that promotes cell movement via Apelin receptors. Science. 2014:343:1248636. 10.1126/science.1248636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pavesi A. Origin and evolution of overlapping genes in the family microviridae. J Gen Virol. 2006:87:1013–1017. 10.1099/vir.0.81375-0. [DOI] [PubMed] [Google Scholar]
- Pegueroles C, Laurie S, Albà MM. Accelerated evolution after gene duplication: a time-dependent process affecting just one copy. Mol Biol Evol. 2013:30:1830–1842. 10.1093/molbev/mst083. [DOI] [PubMed] [Google Scholar]
- Peng J, Zhao L. The origin and structural evolution of de novo genes in Drosophila. Nat Commun. 2024:15:810. 10.1038/s41467-024-45028-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Peng M, Wang T, Li Y, Zhang Z, Wan C. Mapping start codons of small open reading frames by n-terminomics approach. Mol Cell Proteomics. 2024:23:100860. 10.1016/j.mcpro.2024.100860. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Petryszak R et al. Expression atlas update—an integrated database of gene and protein expression in humans, animals and plants. Nucleic Acids Res. 2016:44:D746–D752. 10.1093/nar/gkv1045. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Poretti M, Praz CR, Sotiropoulos AG, Wicker T. A survey of lineage-specific genes in triticeae reveals de novo gene evolution from genomic raw material. Plant Direct. 2023:7:e484. 10.1002/pld3.v7.3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prabh N, Rödelsperger C. De novo, divergence, and mixed origin contribute to the emergence of orphan genes in pristionchus nematodes. G3 (Bethesda). 2019:9:2277–2286. ISSN 2160-1836. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6643871/. 10.1534/g3.119.400326. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Prensner JR et al. Noncanonical open reading frames encode functional proteins essential for cancer cell survival. Nat Biotechnol. 2021:39:697–704. 10.1038/s41587-020-00806-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Pruitt KD, Tatusova T, Maglott DR. Ncbi reference sequence (refseq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2005:33:D501–D504. 10.1093/nar/gki025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Quinlan AR, Hall IM. Bedtools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010:26:841–842. 10.1093/bioinformatics/btq033. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raghavan V, Kraft L, Mesny F, Rigerte L. A simple guide to de novo transcriptome assembly and annotation. Brief Bioinform. 2022:23:bbab563. 10.1093/bib/bbab563. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rancurel C, Khosravi M, Dunker AK, Romero PR, Karlin D. Overlapping genes produce proteins with unusual sequence properties and offer insight into de novo protein creation. J Virol. 2009:83:10719–10736. 10.1128/JVI.00595-09. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ranz JM, Casals F, Ruiz A. How malleable is the eukaryotic genome? Extreme rate of chromosomal rearrangement in the genus Drosophila. Genome Res. 2001:11:230–239. ISSN 1088-9051, 1549-5469. 10.1101/gr.162901. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Renz PF, Valdivia-Francia F, Sendoel A. Some like it translated: small ORFs in the 5’utr. Exp Cell Res. 2020:396:112229. 10.1016/j.yexcr.2020.112229. [DOI] [PubMed] [Google Scholar]
- Rice P, Longden I, Bleasby A. Emboss: the European molecular biology open software suite. Trends Genet. 2000:16:276–277. 10.1016/S0168-9525(00)02024-2. [DOI] [PubMed] [Google Scholar]
- Rivard EL et al. A putative de novo evolved gene required for spermatid chromatin condensation in Drosophila melanogaster. PLoS Genet. 2021:17:e1009787. 10.1371/journal.pgen.1009787. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Roginski P, Grandchamp A, Quignot C, Lopes A. De novo emerged gene search in eukaryotes with dense. Genome Biol Evol. 2024:16:evae159. 10.1093/gbe/evae159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rogozin IB et al. Purifying and directional selection in overlapping prokaryotic genes. Trends Genet. 2002:18:228–232. 10.1016/S0168-9525(02)02649-5. [DOI] [PubMed] [Google Scholar]
- Rombel IT, Sykes KF, Rayner S, Johnston SA. Orf-finder: a vector for high-throughput gene identification. Gene. 2002:282:33–41. 10.1016/S0378-1119(01)00819-8. [DOI] [PubMed] [Google Scholar]
- Roy SW. The origin of recent introns: transposons? Genome Biol. 2004:5:251. 10.1186/gb-2004-5-12-251. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruiz-Orera J, Albà MM. Translation of Small Open Reading Frames: Roles in Regulation and Evolutionary Innovation. Trends Genet. 2019:35:186–198. ISSN: 13624555. 10.1016/j.tig.2018.12.003. [DOI] [PubMed] [Google Scholar]
- Ruiz-Orera J, Messeguer X, Subirana JA, Alba MM. Long non-coding RNAs as a source of new peptides. Elife. 2014:3:1–24. ISSN 2050084X. 10.7554/eLife.03523. [DOI] [Google Scholar]
- Ruiz-Orera J, Villanueva-Cañas JL, Blevins W, Albà M. 2017. De novo gene evolution: How do we transition from non-coding to coding? PeerJ Preprints. 10.7287/peerj.preprints.3031v2 [DOI]
- Söding J. Protein homology detection by HMM–HMM comparison. Bioinformatics. 2005:21:951–960. 10.1093/bioinformatics/bti125. [DOI] [PubMed] [Google Scholar]
- Sandmann C-L et al. Evolutionary origins and interactomes of human, young microproteins and small peptides translated from short open reading frames. Mol Cell. 2023:83:994–1011. 10.1016/j.molcel.2023.01.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Scalzitti N, Jeannin-Girardon A, Collet P, Poch O, Thompson JD. A benchmark study of ab initio gene prediction methods in diverse eukaryotic organisms. BMC Genomics. 2020:21:1–20. 10.1186/s12864-020-6707-9. [DOI] [Google Scholar]
- Schlesinger D, Elsässer SJ. Revisiting sorfs: overcoming challenges to identify and characterize functional microproteins. FEBS J. 2022:289:53–74. 10.1111/febs.v289.1. [DOI] [PubMed] [Google Scholar]
- Schlötterer C. Genes from scratch–the evolutionary fate of de novo genes. Trends Genet. 2015:31:215–219. 10.1016/j.tig.2015.02.007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schmitz J, Brosius J. Exonization of transposed elements: a challenge and opportunity for evolution. Biochimie. 2011:93:1928–1934. 10.1016/j.biochi.2011.07.014. [DOI] [PubMed] [Google Scholar]
- Schmitz JF, Ullrich KK, Bornberg-Bauer E. Incipient de novo genes can evolve from frozen accidents that escaped rapid transcript turnover. Nat Ecol Evol. 2018:2:1626–1632. 10.1038/s41559-018-0639-7. [DOI] [PubMed] [Google Scholar]
- Schneider AL, Martins-Silva R, Kaizeler A, Saraiva-Agostinho N, Barbosa-Morais NL. voyager, a free web interface for the analysis of age-related gene expression alterations in human tissues. Elife. 2024:12:88623. 10.7554/eLife.88623. [DOI] [Google Scholar]
- Schoch CL et al. Ncbi taxonomy: a comprehensive update on curation, resources and tools. Database. 2020:2020:baaa062. 10.1093/database/baaa062. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Singh U, Wurtele ES. orfipy: a fast and flexible tool for extracting ORFs. Bioinformatics. 2021:37:3019–3020. 10.1093/bioinformatics/btab090. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slater GSC, Birney E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 2005:6:1–11. 10.1186/1471-2105-6-31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Slavoff SA et al. Peptidomic discovery of short open reading frame–encoded peptides in human cells. Nat Chem Biol. 2013:9:59–64. 10.1038/nchembio.1120. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sorek R. The birth of new exons: mechanisms and evolutionary consequences. RNA. 2007:13:1603–1608. 10.1261/rna.682507. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Stoesser G et al. The EMBL nucleotide sequence database. Nucleic Acids Res. 2002:30:21–26. 10.1093/nar/30.1.21. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Tautz D, Domazet-Lošo T. The evolutionary origin of orphan genes. Nat Rev Genet. 2011:12:692–702. 10.1038/nrg3053. [DOI] [PubMed] [Google Scholar]
- Temin HM, Mizutami S. Rna-dependent DNA polymerase in virions of rous sarcoma virus. Nature. 1970:226:1211–1213. 10.1038/2261211a0. [DOI] [PubMed] [Google Scholar]
- Thomas KE, Gagniuc PA, Gagniuc E. Moonlighting genes harbor antisense ORFs that encode potential membrane proteins. Sci Rep. 2023:13:12591. 10.1038/s41598-023-39869-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Toll-Riera M et al. Origin of primate orphan genes: a comparative genomics approach. Mol Biol Evol. 2009:26:603–612. 10.1093/molbev/msn281. [DOI] [PubMed] [Google Scholar]
- Turcan A, Lee J, Wacholder A, Carvunis A-R. Integrative detection of genome-wide translation using iRibo. STAR Protoc. 2024:5:102826. ISSN 2666-1667. 10.1016/j.xpro.2023.102826. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Uz-Zaman MH, D’Alton S, Barrick JE, Ochman H. Promoter recruitment drives the emergence of proto-genes in a long-term evolution experiment with escherichia coli. PLoS Biol. 2024:22:e3002418. 10.1371/journal.pbio.3002418. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vakirlis N, Acar O, Cherupally V, Carvunis A-R. Ancestral sequence reconstruction as a tool to detect and study de novo gene emergence. Genome Biol Evol. 2024:16:evae151. ISSN 1759-6653. 10.1093/gbe/evae151. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vakirlis N, Carvunis A-R, McLysaght A. Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. Elife. 2020:9:e53500. 10.7554/eLife.53500. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vakirlis N et al. A molecular portrait of de novo genes in yeasts. Mol Biol Evol. 2018:35:631–645. 10.1093/molbev/msx315. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vakirlis N, Kupczok A. Large-scale investigation of species-specific orphan genes in the human gut microbiome elucidates their evolutionary origins. Genome Res. 2024:34:888–903. 10.1101/gr.278977.124. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vakirlis N, McLysaght A. Computational prediction of de novo emerged protein-coding genes. In: Computational methods in protein evolution. Springer; New York; 2018. p. 63–81. [Google Scholar]
- Vakirlis N, Vance Z, Duggan KM, McLysaght A. De novo birth of functional microproteins in the human lineage. Cell Rep. 2022:41:111808. 10.1016/j.celrep.2022.111808. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Kempen M et al. Fast and accurate protein structure search with foldseek. Nat Biotechnol. 2024:42:243–246. 10.1038/s41587-023-01773-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Van Oss SB, Carvunis A-R. De novo gene birth. PLoS Genet. 2019:15:e1008160. 10.1371/journal.pgen.1008160. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vara C, Montañés JC, Albà MM. High polymorphism levels of de novo ORFs in a Yoruba human population. Genome Biol Evol. 2024:16:evae126. 10.1093/gbe/evae126. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Varabyou A, Erdogdu B, Salzberg SL, Pertea M. Investigating open reading frames in known and novel transcripts using ORFanage. Nat Comput Sci. 2023:3:700–708. 10.1038/s43588-023-00496-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vitting-Seerup K, Porse BT, Sandelin A, Waage J. splicer: an R package for classification of alternative splicing and prediction of coding potential from RNA-seq data. BMC Bioinformatics. 2014:15:1–7. 10.1186/1471-2105-15-81. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vonica A et al. Apcdd1 is a dual BMP/Wnt inhibitor in the developing nervous system and skin. Dev Biol. 2020:464:71–87. 10.1016/j.ydbio.2020.03.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wacholder A, Carvunis A-R. Biological factors and statistical limitations prevent detection of most noncanonical proteins by mass spectrometry. PLoS Biol. 2023:21:e3002409. 10.1371/journal.pbio.3002409. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wacholder A et al. A vast evolutionarily transient translatome contributes to phenotype and fitness. Cell Syst. 2023:14:363–381. 10.1016/j.cels.2023.04.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wallace EW, Airoldi EM, Drummond DA. Estimating selection on synonymous codon usage from noisy experimental data. Mol Biol Evol. 2013:30:1438–1453. 10.1093/molbev/mst051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang L, Park HJ, Dasari S, Wang S, Kocher J-P, Li W. Cpat: coding-potential assessment tool using an alignment-free logistic regression model. Nucleic Acids Res. 2013:41:e74–e74. 10.1093/nar/gkt006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y et al. Mcscanx: a toolkit for detection and evolutionary analysis of gene synteny and collinearity. Nucleic Acids Res. 2012:40:e49–e49. 10.1093/nar/gkr1293. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Z, Chen Y, Li Y. A brief review of computational gene prediction methods. Genomics Proteomics Bioinformatics. 2004:2:216–221. 10.1016/S1672-0229(04)02028-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ward LD, Kellis M. Evidence of abundant purifying selection in humans for recently acquired regulatory functions. Science. 2012:337:1675–1678. 10.1126/science.1225057. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weisman CM. The origins and functions of de novo genes: against all odds? J Mol Evol. 2022:90:244–257. ISSN 1432-1432. 10.1007/s00239-022-10055-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weisman CM, Murray AW, Eddy SR. Many, but not all, lineage-specific genes can be explained by homology detection failure. PLoS Biol. 2020:18:e3000862. ISSN 1545-7885. https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000862. Publisher: Public Library of Science. 10.1371/journal.pbio.3000862. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weisman CM, Murray AW, Eddy SR. Mixing genome annotation methods in a comparative analysis inflates the apparent number of lineage-specific genes. Curr Biol. 2022:32:2632–2639.e2. ISSN 0960-9822. https://www.sciencedirect.com/science/article/pii/S0960982222007217. 10.1016/j.cub.2022.04.085. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Whiffin N et al. Characterising the loss-of-function impact of 5’untranslated region variants in 15,708 individuals. Nat Commun. 2020:11:2523. 10.1038/s41467-019-10717-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wilson BA, Masel J. Putatively noncoding transcripts show extensive association with ribosomes. Genome Biol Evol. 2011:3:1245–1252. ISSN 17596653. 10.1093/gbe/evr099. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wright BW, Yi Z, Weissman JS, Chen J. The dark proteome: translation from noncanonical open reading frames. Trends Cell Biol. 2022:32:243–258. 10.1016/j.tcb.2021.10.010. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu D-D, Irwin DM, Zhang Y-P. De novo origin of human protein-coding genes. PLoS Genet. 2011:7:e1002379. 10.1371/journal.pgen.1002379. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wu Q, Wright M, Gogol MM, Bradford WD, Zhang N, Bazzini AA. Translation of small downstream ORFs enhances translation of canonical main open reading frames. EMBO J. 2020:39:e104763. 10.15252/embj.2020104763. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xia S, Chen J, Arsala D, Emerson J, Long M. Functional innovation through new genes as a general evolutionary process. Nat Genet. 2025:57:295–309. 10.1038/s41588-024-02059-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu A et al. Transcriptomes of aging brain, heart, muscle, and spleen from female and male African turquoise killifish. Sci Data. 2023:10:695. 10.1038/s41597-023-02609-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xu H et al. Length of the ORF, position of the first AUG and the Kozak motif are important factors in potential dual-coding transcripts. Cell Res. 2010:20:445–457. 10.1038/cr.2010.25. [DOI] [PubMed] [Google Scholar]
- Yang Z, Bielawski JP. Statistical methods for detecting molecular adaptation. Trends Ecol Evol. 2000:15:496–503. 10.1016/S0169-5347(00)01994-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zdobnov EM et al. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science. 2002:298:149–159. ISSN 0036-8075, 1095-9203. 10.1126/science.1077061. [DOI] [PubMed] [Google Scholar]
- Zhang L et al. Rapid evolution of protein diversity by de novo origination in Oryza. Nat Ecol Evol. 2019:3:679–690. 10.1038/s41559-019-0822-5. [DOI] [PubMed] [Google Scholar]
- Zhao L, Saelao P, Jones CD, Begun DJ. Origin and spread of de novo genes in Drosophila melanogaster populations. Science. 2014:343:769–772. ISSN 0036-8075. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4391638/. 10.1126/science.1248286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhao T, Schranz ME. Network approaches for plant phylogenomic synteny analysis. Curr Opin Plant Biol. 2017:36:129–134. 10.1016/j.pbi.2017.03.001. [DOI] [PubMed] [Google Scholar]
- Zheng EB, Zhao L. Protein evidence of unannotated ORFs in drosophila reveals diversity in the evolution and properties of young proteins. Elife. 2022:11:e78772. 10.7554/eLife.78772. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
All data and source code for the developed annotation format based on this work can be found at https://github.com/EDohmen/denofo.


