Skip to main content
eLife logoLink to eLife
. 2023 May 24;12:e82290. doi: 10.7554/eLife.82290

The impact of local genomic properties on the evolutionary fate of genes

Yuichiro Hara 1,, Shigehiro Kuraku 2,3,4
Editors: Wenfeng Qian5, George H Perry6
PMCID: PMC10208646  PMID: 37223962

Abstract

Functionally indispensable genes are likely to be retained and otherwise to be lost during evolution. This evolutionary fate of a gene can also be affected by factors independent of gene dispensability, including the mutability of genomic positions, but such features have not been examined well. To uncover the genomic features associated with gene loss, we investigated the characteristics of genomic regions where genes have been independently lost in multiple lineages. With a comprehensive scan of gene phylogenies of vertebrates with a careful inspection of evolutionary gene losses, we identified 813 human genes whose orthologs were lost in multiple mammalian lineages: designated ‘elusive genes.’ These elusive genes were located in genomic regions with rapid nucleotide substitution, high GC content, and high gene density. A comparison of the orthologous regions of such elusive genes across vertebrates revealed that these features had been established before the radiation of the extant vertebrates approximately 500 million years ago. The association of human elusive genes with transcriptomic and epigenomic characteristics illuminated that the genomic regions containing such genes were subject to repressive transcriptional regulation. Thus, the heterogeneous genomic features driving gene fates toward loss have been in place and may sometimes have relaxed the functional indispensability of such genes. This study sheds light on the complex interplay between gene function and local genomic properties in shaping gene evolution that has persisted since the vertebrate ancestor.

Research organism: Human, Chimpanzee, Mouse, Chicken, Central bearded dragon , Tropical clawed flog , Coelacanth, Spotted gar, Bamboo shark

Introduction

In the course of evolution, genomes continue to retain most genes with occasional duplications, while losing some genes (Blomme et al., 2006; Fernández and Gabaldón, 2020; Shen et al., 2018). This retention and loss can be interpreted as gene fate; genes are stably retained in the genome, but some factors may cause them to transition to a state where deletion occurs. Accordingly, identification of the factors allowing gene loss may facilitate our understanding of gene fate. Gene retention or loss has generally been considered to depend largely on the functional importance of the particular gene from the perspective of molecular evolutionary biology (Albalat and Cañestro, 2016; Bartha et al., 2018; Blanc et al., 2012; Liu et al., 2015; Olson, 1999; Sharma et al., 2018; Shen et al., 2018). Genes with indispensable functions have usually been retained with highly conserved sequences in genomes through rapid elimination of alleles that impair gene functions (Hirsh and Fraser, 2001; Krylov et al., 2003; Miyata et al., 1980; Pál et al., 2006). On the contrary, genes with less important functions are likely to accept more mutations and structural variations, which can degrade the original functions, leading to gene loss through pseudogenization or genomic deletion (Jordan et al., 2002; Yang et al., 2003). To date, gene loss has been imputed to the relaxation of functional constraints of individual genes. Gene loss has further been revealed to drive phenotypic adaptation in various organisms (Albalat and Cañestro, 2016; Olson, 1999), as well as in a gene knockout collection of yeasts in culture (Giaever and Nislow, 2014; Maclean et al., 2017).

To uncover the association between fates and functional importance of the genes, molecular evolutionary analyses have been conducted at various scales, from gene-by-gene to genome-wide. A number of studies have revealed that the genes with reduced non-synonymous substitution rates (or KA values) and ratios of non-synonymous to synonymous substitution rates (KA/KS ratios) are less likely to be lost (Jordan et al., 2002; Yang et al., 2003). A genome-wide comparison of duplicated genes in yeast revealed larger KA values for those lost in multiple lineages than those retained by all the species investigated (Byrne and Wolfe, 2007). Other comprehensive studies of gene loss across metazoans and teleosts revealed that the genes expressed in the central nervous system are less prone to loss (Fernández and Gabaldón, 2020; Roux et al., 2017). These observations again suggest that gene fate depends on the functional constraints of a particular gene.

Besides functional constraints, several studies have identified the genes lost independently in multiple lineages, revealing that the genomic regions containing these genes ‘prefer’ particular characteristics associated with structural instability (Cortez et al., 2014; Hughes et al., 2012; Lewin et al., 2021; Maeso et al., 2016). In mammals, tandemly arrayed homeobox genes derived from the Crx gene family were lost in multiple species (Lewin et al., 2021; Maeso et al., 2016). The findings suggest that genomic features containing tandem duplications facilitate unequal crossing over, leading to frequent gene loss. Mammalian chromosome Y, which contains abundant repetitive elements and continues to reduce in size, has lost a considerable number of genes (Cortez et al., 2014; Hughes et al., 2012). In the stickleback genome, a Pitx1 enhancer was independently lost in multiple lineages inhabiting freshwater due to its genomic location in a structurally fragile site, leading to recurrent loss of pelvic fins (Xie et al., 2019). Genes and genomic elements in such particular regions may be prone to loss in a more neutral manner than the relaxation of functional importance or via functional adaptations. Accordingly, these studies focusing on the particular genomic regions led us to search for the common features in genomes that potentially facilitate gene loss. Genome-wide scans have revealed heterogeneous distributions of a variety of sequence and structural features so far, for example, base composition (Bernardi and Bernardi, 1986; Cohen et al., 2005; Katzman et al., 2011), the frequency of repetitive elements (Korenberg and Rykowski, 1988; Medstrand et al., 2002), and DNA-damage sensitivity induced by replication inhibitors (Debatisse et al., 2012; Helmrich et al., 2006). However, the extent to which these characteristics are associated with gene fates has not been understood well at a genome-wide level.

The accumulation of near-complete genome assemblies for various organisms facilitates comprehensive taxon-wide analysis of gene loss (Fernández and Gabaldón, 2020; Guijarro-Clarke et al., 2020; Rice and McLysaght, 2017). Along with this motivation, we recently performed a comprehensive analysis on the fate of paralogs generated via the two-round whole-genome duplications in early vertebrates (Hara et al., 2018a). The results revealed that the genes retained by reptiles but lost in mammals and Aves rapidly accumulated not only non-synonymous but also synonymous substitutions in comparison with the counterparts retained by almost all the vertebrates examined, indicating that those genes prone to loss show increasing mutation rates. Furthermore, these loss-prone genes were located in genomic regions with high GC contents, high gene densities, and high repetitive element frequencies. These findings suggest that the fates of those genes are influenced not only by functional constraints but also by intrinsic genomic characteristics. Because the findings were restricted to a set of particular genes, they prompted us to examine whether this trend is associated with gene fates on a genome-wide scale.

In this study, we inferred molecular phylogenies of vertebrate orthologs to systematically search for the genes harboring different fates in the human genome. We previously referred to the nature of genes prone to loss as ‘elusive’ (Hara et al., 2018a; Hara et al., 2018b). In this study, we define the elusive genes as those that are retained by modern humans but have been lost independently in multiple mammalian lineages. As a comparison of the elusive genes, we retrieved the genes that were retained by almost all of the mammalian species examined and defined them as ‘non-elusive,’ representing those persistent in the genomes. We conducted a careful search for gene loss to reduce the false discovery rate (FDR), which is usually caused by incomplete sequence information (Botero-Castro et al., 2017; Deutekom et al., 2019). By comparing the genomic regions containing these genes, we uncovered genomic characteristics relevant to gene loss. We associated the elusive genes with a variety of findings from deep sequencing analyses of the human genome, including transcriptomics, epigenomics, and genetic variations. These data assisted us to understand how intrinsic genomic features may affect gene fate, leading to gene loss by decreasing the expression level and eventually relaxing the functional importance of ‘elusive’ genes.

Results

Identification of human ‘elusive’ genes

We defined an ‘elusive’ gene as a human protein-coding gene that existed in the common mammalian ancestors but was lost independently in multiple mammalian lineages (Figure 1; see ‘Materials and methods’ for details). We searched for such genes by reconstructing phylogenetic trees of vertebrate orthologs and detecting gene loss events within the individual trees. To search for elusive genes, we paid close attention to distinguishing true evolutionary gene loss from falsely inferred gene loss caused by insufficient genome assembly, gene prediction, and orthologous clustering (Botero-Castro et al., 2017; Deutekom et al., 2019), as described below.

Figure 1. Detection of ‘elusive’ genes.

Figure 1.

(a) Pipeline of ortholog group clustering and gene loss detection. (b) Definition of an elusive gene schematized with ortholog presence/absence pattern referring to a taxonomic hierarchy. Red and orange crosses denote the gene loss in the common ancestor of a taxon and the loss specific to a single species, respectively. (c) A representative phylogeny of the elusive gene encoding Chitinase 3-like 2 (CHI3L2). Taxa shown in the tree were used to investigate the presence or absence of orthologs. The Sciuromorpha, Hystricognathi, Eulipotyphla, Carnivora, and Chiroptera are absent from the tree, indicating that the CHI3L2 orthologs were lost somewhere along the branches framed in gray in the tree. In addition, the orthologs of many members of the Myomorpha were not found, suggesting that gene loss occurred in this lineage.

We first produced highly complete orthologous groups comprised of nearly complete gene sets. We merged multiple gene annotations of a single species followed by assessments of the completeness of the gene sets (Figure 1a). Using these gene sets, we then created two sets of ortholog groups with different methods and merged them into a single set (Figure 1a). In searching for gene loss events, we restricted our study to those that occurred in the common ancestors of particular taxonomic groups. This procedure relieved false identifications of gene loss in a species or an ancestor of a lower taxonomic hierarchy caused by incomplete genomic information (Figure 1b).

We integrated gene annotations from Ensembl, RefSeq, and the sequence repositories of individual genome sequencing projects to produce gene annotations for 114 mammalian and 132 non-mammalian vertebrates. From these, we selected the annotations of 101 and 90 species, respectively, that exhibited high completeness in the BUSCO assessment (Simão et al., 2015; Supplementary Table S1 in Supplementary file 1a). Using these gene sets, clustering of ortholog groups was conducted by OrthoFinder, and these groups were integrated into the ortholog groups provided by the Ensembl Gene Tree. This integration resulted in 50,768 vertebrate ortholog groups. Phylogenetic tree inference of the integrated ortholog groups and pruning of the individual trees based on gene duplications resulted in 17,495 mammalian ortholog groups that contained human genes. We classified the mammalian species into 15 taxonomic groups ranging from order to family (listed in Table S1; Supplementary file 1a). For the individual mammalian orthologs, we searched for the taxa in which the gene was absent in all the species examined (Figure 1b). We interpreted this gene absence as an evolutionary loss that occurred in the common ancestor of the taxon. Validating the gene loss through an ortholog search in genome assemblies and synteny-based ortholog annotations, we extracted the ortholog groups that were retained by humans but were lost independently in the common ancestors of at least two taxa (Figure 1c). Hereafter we call the human genes belonging to these ortholog groups ‘elusive genes.’ To compare these, we also selected the ortholog groups that contained all of the mammals examined including single-copy human genes. We called these ‘non-elusive genes.’ This comprehensive scan of gene phylogenies resulted in 813 elusive and 8050 non-elusive genes (Supplementary Table S2; Supplementary file 2).

Genomic signatures of the human elusive genes

The loss-prone nature of the elusive genes suggests a relaxation of their functional constraints. To uncover the molecular evolutionary characteristics associated with each elusive gene, we computed synonymous and non-synonymous substitution rates in coding regions, namely KS and KA, respectively, between human and chimpanzee and mouse orthologs for the elusive and non-elusive genes. In addition, we computed nucleotide substitution rates for introns (KI) between human and chimpanzee (Pan troglodytes) orthologs and compared them between the elusive and non-elusive genes. The results showed larger KA values in the ortholog pairs of the elusive genes than in those of the non-elusive genes (Figure 2a, Figure 2—figure supplement 1). This indicates a rapid accumulation of amino acid substitutions in the elusive genes, potentially accompanied by the relaxation of functional constraints. Our analysis further illuminated larger KS and KI values for the elusive genes than in the non-elusive genes (Figure 2b and c, Figure 2—figure supplement 1). Importantly, the higher rate of synonymous and intronic nucleotide substitutions, which may not affect changes in amino acid residues, indicates that the elusive genes are also susceptible to genomic characteristics independent of selective constraints on gene functions.

Figure 2. Genomic and evolutionary characteristics of elusive genes.

Distributions of non-synonymous, synonymous, and intronic nucleotide substitution rates, namely KA (a), KS (b), and KI (c) values, respectively, between the human–chimpanzee orthologs of the elusive and non-elusive genes. Distribution of gene length (d) and GC content (e) of the human elusive and non-elusive genes. (f) Distribution of gene density in the genomic regions where the human elusive and non-elusive genes are located. The plots consist of 249 elusive and 5145 non-elusive genes that retained chimpanzee orthologs (a, b), 473 and 4626 of those which harbored introns aligned with the chimpanzee genome (c; see ‘Materials and methods’), and all of the 813 elusive and 8050 non-elusive genes (d–f). Diamonds and bars within violin plots indicate the median and range from the 25th to 75th percentile, respectively.

Figure 2.

Figure 2—figure supplement 1. Comparison of KA and KS values between orthologs of the elusive and non-elusive genes.

Figure 2—figure supplement 1.

Distributions of KA and KS values between the orthologs of human elusive and non-elusive genes of closely related vertebrates. Correction for multiple testing was performed for comparison in each species pair. Diamonds and bars within violin plots indicate the median and range from 25th to 75th percentile, respectively.

To further scrutinize the characteristics reflecting the genomic environment rather than gene function, we analyzed genomic characteristics that may distinguish the elusive from non-elusive genes. A comparison between these two categories revealed shorter gene-body lengths and higher GC contents of elusive rather than non-elusive genes (Figure 2d and e). Furthermore, a scan of intragenomic gene distribution revealed that the elusive genes were located in the genomic regions with high gene density compared with the non-elusive genes (Figure 2f). Our findings indicate that such elusive genes have distinct characteristics in the human genome. These genomic characteristics, as well as high nucleotide substitution rates, were consistent with the findings in our genome analyses using the amniote and elasmobranch genomes (Hara et al., 2018a; Hara et al., 2018b).

Tracing elusiveness back along the vertebrate evolutionary tree

The origins of the human elusive genes can be traced back along the evolutionary tree, at least to the mammalian common ancestor. To investigate possible antiquities of the genomic properties associated with elusive genes, we investigated their orthologs in non-mammalian vertebrates by scrutinizing the ortholog groups used for elusive gene identification. We found that 152 out of 813 elusive genes originated in mammalian lineages, and this proportion was larger than those of the elusive genes (65 out of 8050, p=2.50 × 10-110), indicating that the elusive genes are more abundant in recently born genes than non-elusive genes. We then selected 517 elusive and 7900 non-elusive genes that originated in the common ancestors of jawed vertebrates or earlier. These subsets allowed us to examine the degree of retention of non-mammalian vertebrate orthologs in the elusive and non-elusive genes. On average, approximately 40% of these elusive genes were found to be retained by non-mammalian vertebrates, while this proportion increased up to 90% for the non-elusive genes. (Figure 3—figure supplement 1a). In the coelacanth, gar, and shark, the orthologs of the elusive genes were less frequently retained by all the species than those of the non-elusive ones (Figure 3—figure supplement 1b). The results suggest that the origins of the loss-prone propensity of the elusive genes potentially date back to the period long before the emergence of the Mammalia.

We further examined the genomic characteristics associated with the human elusive genes in the vertebrate orthologs. In all the species examined, orthologs of the elusive genes exhibited high GC content and compact gene bodies. Additionally, in most of these species, the orthologs of elusive genes were located in genomic regions with high gene density compared with orthologs of the non-elusive genes (Figure 3, Figure 3—figure supplement 2). In addition, we computed KS and KA values between the orthologs of the vertebrate species and their close relatives for elusive and non-elusive genes. In any of the species pairs except for avians, the orthologs of the elusive genes were found to harbor higher KA and KS values than those of the non-elusive gene orthologs (Figure 3, Figure 2—figure supplement 1). These observations indicate that these genomic characteristics probably originated before the emergence of gnathostomes, a monophyletic group of chondrichthyan and bony vertebrates, and have been retained for approximately 500 million years.

Figure 3. Long-standing characteristics of elusive genes.

Retention of the genomic and evolutionary characteristics of the human elusive genes across vertebrates. The individual round squares with arrows indicate significant increases or decreases of the distribution of particular characteristics in the orthologs of the human elusive genes and their flanking regions compared with those of the non-elusive genes in these selected vertebrate genomes. For the chimpanzee and mouse genomes, KA and KS values were computed between the human elusive genes and the orthologs of these mammals. For non-mammalian species, these values were computed with ortholog pairs for the elusive/non-elusive genes between the corresponding species and their closely related species: turkey for chicken, green anole for central bearded dragon, and whale shark for bamboo shark. Distributions of these metrics for non-human species are shown in Figure 2—figure supplement 1 and Figure 3—figure supplement 2. Species name: mouse, Mus musculus; chicken, Gallus gallus; central bearded dragon, Pogona vitticeps; Western clawed frog, Xenopus tropicalis; coelacanth, Latimeria chalumnae; spotted gar, Lepisosteus oculatus; bamboo shark, Chiloscyllium plagiosum.

Figure 3.

Figure 3—figure supplement 1. Asymmetric ortholog retention across the vertebrates.

Figure 3—figure supplement 1.

(a) Number of retained orthologs of the human elusive and non-elusive genes that originated in the common ancestors of the gnathostomes or older age. (b) Intersections of the retained orthologs across three vertebrates distantly related to modern humans (b). The p-value of the 2×2 contingency table given by Fisher’s exact test is 9.70×10–48.
Figure 3—figure supplement 1—source data 1. A 2×2 contingency table in Figure 3—figure supplement 1.
Figure 3—figure supplement 2. Genomic characteristics of the orthologs of elusive and non-elusive genes.

Figure 3—figure supplement 2.

Distribution of (a) gene length and (b) GC content of the orthologs of the human elusive and non-elusive genes and (c) distribution of the gene density of the genomic regions where the orthologs of the human elusive and non-elusive genes are located. For the individual genomic characteristics, correction for multiple testing was performed for comparison in each species. Numbers of the orthologs of the elusive and non-elusive genes re indicated in Figure 3—figure supplement 1. Diamonds and bars within violin plots indicate the median and range from the 25th to 75th percentile, respectively.

Abundant polymorphism in elusive genes

The observation of large KS and KA values in the elusive genes prompted us to examine the extent to which these genes have accommodated genetic variations in modern humans. Large-scale human genome resequencing projects have identified a huge number of genetic variations, from rare to common, and from single-nucleotide variants (SNVs) to chromosome-scale structural variants, facilitating tackling this issue. We retrieved copy number variants (CNVs) and rare SNVs in the human genome from the Database of Genomic Variants, release 2016-08-31 (MacDonald et al., 2014) and dbSNP release 147 (Sherry et al., 2001), respectively, and computed their densities in the individual genic regions. We found that the genic regions of the human elusive genes contained abundant rare SNVs, as well as deletion and duplication CNVs, compared with those of the non-elusive genes (Figure 4a–c). This result suggests that genomic regions containing the elusive genes are not only prone to loss but also to duplication.

Figure 4. Genetic variations of the elusive and non-elusive genes within human populations.

Figure 4.

Comparison of the density of rare single-nucleotide variants (SNVs) (a), deletion copy number variants (CNVs) (b), duplication CNVs (c), and Z-scores of synonymous (d), missense (e), and loss-of-function variants (f). We used opposite numbers of the Z-scores in d–f so that the elusive genes have higher values than non-elusive genes as in Figure 2a, b, c, e, f and Figure 3a–c. (a–c) 813 elusive genes and 8050 non-elusive genes were used. (d–f) 544 elusive genes and 7303 non-elusive genes for which genetic variants were available in GnomAD were used. Diamonds and bars within violin plots indicate the median and range from 25th to 75th percentile, respectively.

To evaluate the functional consequences of abundant genetic variants in the elusive genes, we investigated genetic variations stored in the gnomAD v. 2.1 database, a repository containing >120,000 exome and >15,000 whole-genome sequences of human individuals (Karczewski et al., 2021). This database classifies SNVs in coding regions into three categories—synonymous, missense, and loss-of-function—and the loss-of-function category contains nonsense mutations, frameshift mutations, and mutations in splicing junctions. The gnomAD site computes a Z-score, an index representing the abundance of SNVs for individual genes; positive and negative values denote fewer or more mutations in a coding region than expected, respectively (Figure 4d–f). Accordingly, the Z-score for nonsense mutations and loss-of-function mutations of the individual genes indicates the degree of natural selection: larger values demonstrate genes subjected to purifying selection, while smaller ones suggest functional relaxation. We found lower Z-scores of missense and loss-of-function mutations (higher opposite numbers of Z-scores in Figure 4e and f) in the human elusive genes than in the non-elusive genes, suggesting that the elusive genes are more functionally dispensable and potentially tolerable to harmful mutations. Additionally, opposite numbers of Z-scores of synonymous mutations of the human elusive genes were higher than those of the non-elusive genes (Figure 4d). This confirms the high mutability of genomic regions containing elusive genes, as observed in the KS values.

Transcriptomic natures of elusive genes

To further investigate how the human elusive genes have decreased functional essentiality, we examined their expression profiles. For this purpose, we compared gene expression profiles of the 54 adult tissues from the GTEx database v. 8 (The GTEx Consortium et al., 2020) between the elusive and non-elusive genes. For individual genes, we computed the maximum transcription per million (TPM) values among these tissues as the expression quantity level. For expression diversities, we employed Shannon’s diversity index H′, which is often utilized as an index of species diversity in the ecological literature, based on the proportion of TPM values across the 54 tissues.

As shown in the density scatter plots of the individual genes displaying these two indicators in Figure 5, most of the non-elusive genes possessed large maximum TPM and H′ values. Thus, most non-elusive genes are ubiquitously expressed at certain levels. By contrast, the density plot of the elusive genes displayed an additional high-density spot with small TPM and H′ values, indicating that the genes in this spot were not expressed, at least in adult tissues. The plot also showed another broad dense area of small H′ values, which contained the genes expressed in a single or a few tissues. A similar analysis was performed with the fetal single cell RNA-seq data (Cao et al., 2020), revealing that the averaged expression profiles of the elusive and non-elusive genes for the 172 cell types were concordant with those of the adult tissues (Figure 5). Our findings demonstrate that some elusive genes harbor low-level and spatially restricted expression profiles, that is, less pleiotropic states, which are rarely observed in the non-elusive genes.

Figure 5. Expression profiles of elusive and non-elusive genes.

The figure shows density scatter plots of the expression quantity and divergence of elusive and non-elusive genes. The numbers of the elusive/non-elusive genes and those for which the expression quantities were available are indicated in each panel. p-values were computed via 2 × 2 contingency tables presenting numbers of elusive and non-elusive genes with H′ < 1 and H′ ≥ 1. The median transcription per million (TPM) value of each of the adult tissue across individuals was retrieved from the GTEx database (The GTEx Consortium et al., 2020), and normalized TPM values of the fetal cell types were retrieved from the Descartes database (Cao et al., 2020). For the individual genes, maximum TPM and Shannon’s H′ values were computed using these processed TPM values.

Figure 5.

Figure 5—figure supplement 1. Expression profiles of the orthologs of the elusive and non-elusive genes for non-mammalian vertebrates.

Figure 5—figure supplement 1.

Density scatter plots of the expression quantity and divergence of elusive and non-elusive genes. The total numbers of the elusive/non-elusive genes and the number of them for which expression data were available are indicated in each panel. p-values were computed via 2 × 2 contingency tables presenting numbers of elusive and non-elusive genes with H′ < 1 and H′ ≥ 1. Correction for multiple testing was performed for comparison in each species. The transcription per million (TPM) values of the fetal cell types were retrieved from the Bgee database (Bastian et al., 2021). See the details in Figure 5.

Epigenetic nature of elusive genes

Our finding of the low-level and spatially restricted expression patterns of elusive genes prompted us to explore epigenetic properties involved in this transcriptional regulation. Therefore, we retrieved epigenetic data on a variety of human cell lines from a few regulatory genome databases including ENCODE, a repository that stores the comprehensive annotations of functional elements in the human genome (The ENCODE Project Consortium, 2012). Using this information, we characterized the epigenetic features of the genomic regions containing elusive genes (Figure 6).

Figure 6. Epigenetic features of the elusive genes.

Comparison of the distribution of ATAC-seq peak density (a), length of the topologically associating domains (TADs) including the elusive or non-elusive genes (b), the replication timing indicator based on Repli-seq (c), and overlap with the lamina-associated domains (LADs) computed from Lamin B1 ChIP-seq data. All of the analyses were performed by using the processed sequencing data publicly available (Table S3; Supplementary file 1b). (d) ATAC-seq and Hi-C were performed with A549 cells, Repli-seq was performed with HepG2 cells, and Lamin B1 ChIP-seq was performed with HAP-1 cells. In the elusive gene panels (orange bar), purple bar indicates the elusive genes with restricted expressions (H′ < 1; Figure 5). p-values for individual panels indicate the comparison between the elusive (813) and non-elusive (8050) genes and the one between the elusive genes with H′ < 1 (150) and those with H′ ≥ 1 (589). The results for other cells are shown in Figure 6—figure supplements 14 For the individual epigenetic characteristics, correction for multiple testing was performed for comparison in each cell cultures.

Figure 6.

Figure 6—figure supplement 1. ATAC-seq peak density of the elusive and non-elusive gene regions.

Figure 6—figure supplement 1.

Comparison of the distribution of ATAC-seq peak density between the elusive and non-elusive genes across multiple cell types. In the elusive gene panels (orange bar), purple bar indicates the elusive genes with restricted expressions (H′ < 1; Figure 5). p-values for individual panels indicate the comparison between the elusive (813) and non-elusive (8,050) genes and the one between the elusive genes with H′ < 1 (150) and those with H ′ ≥1 (589). Correction for multiple testing was performed for comparison in each cell culture.
Figure 6—figure supplement 2. Sequence lengths of the topologically associating domains (TADs) containing elusive or non-elusive genes.

Figure 6—figure supplement 2.

Comparison of the distribution of length of TADs including the elusive or non-elusive genes across multiple cell types. In the elusive gene panels (orange bar), purple bar indicates the elusive genes with restricted expressions (H′ < 1; Figure 5). p-values for individual panels indicate the comparison between the elusive (813) and non-elusive (8050) genes and the one between the elusive genes with H′ < 1 (150) and those with H′ ≥1 (589). Correction for multiple testing was performed for comparison in each cell culture.
Figure 6—figure supplement 3. Comparison of the replication timing indicator based on Repli-seq between the elusive and non-elusive genes.

Figure 6—figure supplement 3.

Comparison of the distribution of replication timing indicator based on Repli-seq between the elusive and non-elusive genes across multiple cell types. In the elusive gene panels (orange bar), purple bar indicates the elusive genes with restricted expressions (H′ < 1; Figure 5). p-values for individual panels indicate the comparison between the elusive (813) and non-elusive (8050) genes and the one between the elusive genes with H′ < 1 (150) and those with H′ ≥ 1 (589). Correction for multiple testing was performed for comparison in each cell culture.
Figure 6—figure supplement 4. The fraction of elusive and non-elusive genes that overlap with lamina-associated domains (LADs).

Figure 6—figure supplement 4.

Comparison of frequency of overlap with LADs computed from Lamin B1 ChIP-seq data between the elusive and non-elusive genes across multiple data. In the elusive gene panels (orange bar), purple bar indicates the elusive genes with restricted expressions (H′ < 1; Figure 5). p-values for individual panels indicate the comparison between the elusive (813) and non-elusive (8050) genes and the one between the elusive genes with H′ < 1 (150) and those with H′ ≥ 1 (589).
Figure 6—figure supplement 5. ATAC-seq peak density of the chicken orthologs of the elusive and non-elusive gene regions.

Figure 6—figure supplement 5.

Comparison of the distribution of ATAC-seq peak density between the orthologs of elusive (210) and non-elusive (7218) genes in the chicken genome. Correction for multiple testing was performed for comparison in each cell cultures. Diamonds and bars within violin plots indicate the median and range from the 25th to 75th percentile, respectively.

We compared peak densities based on the Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq), an indicator of accessible chromatin regions in the genome, in gene bodies and flanking regions between the elusive and non-elusive genes. In all of the eight cell lines examined (11 samples in total), the results showed fewer ATAC-seq peaks in the genomic regions including the elusive genes than in those including non-elusive genes, indicating that the elusive genes are likely to reside in inaccessible genomic regions (Figure 6a, Figure 6—figure supplement 1). We also searched for topologically associating domains (TADs), genomic elements with frequent physical self-interaction potentially acting as promoter-enhancer contacts (Rao et al., 2014) that included either the elusive or non-elusive genes. The result showed that a higher fraction of the elusive genes resided outside of the TADs than the non-elusive genes for all the eleven cell lines investigated (Figure 6b, Figure 6—figure supplement 2). Furthermore, the elusive genes were located in shorter TADs. These observations suggest that the elusive genes are unlikely to be regulated by distant regulatory elements compared with the non-elusive genes (Figure 6b, Figure 6—figure supplement 2).

Our investigations extended to the association of the elusive genes with further global regulation of genomic structures. We compared the percentage normalized signal of Repli-seq (Hansen et al., 2010), a high-throughput sequencing for quantifying DNA replication time as a function of genomic position, between the elusive and non-elusive genes. The results showed that elusive genes were prone to late replication in all of the 15 cell lines examined (Figure 6c, Figure 6—figure supplement 3). Late-replicating regions are frequently located at the nuclear periphery and often interact with the nuclear lamina. Therefore, we examined the nuclear position of the genomic regions including the elusive genes by referring to the lamina associating domains (LADs) that were identified by the ChIP-seq reads for Lamin B1 (van Schaik et al., 2020; Zheng et al., 2018). Compared with the non-elusive genes, the elusive genes were found to be enriched in LADs for all of the four cell lines examined (Figure 6d, Figure 6—figure supplement 4), consistent with their late replication timings (van Steensel and Belmont, 2017).

We further investigated the association of the restricted expressions of the elusive genes with epigenetic features. From 739 elusive genes whose expressions were quantified in the GTEx database, we classified the elusive genes into two groups based on the pleiotropy in terms of gene expressions: that is, 589 elusive genes with Shannon’s diversity index H′ ≥ 1 were ubiquitously expressed, that is, more pleiotropic, and 150 of those with H′ < 1 were expressed in only a few or none of the tissues examined, that is, less pleiotropic (Figure 5). Importantly, all of the four epigenetic features of the elusive genes with H′ < 1 were more pronounced than those with H′ ≥ 1: sparse ATAC-seq peaks, short TADs, late replication timings, and significant overlaps with LADs (Figure 6, Figure 6—figure supplements 14). This observation suggests that low-level and spatially restricted expressions of the elusive genes are associated with epigenetic features of these genomic regions.

High GC contents in genomic regions potentially hinder identifying an epigenetic feature by short-read sequencing because of the underrepresentation of sequence reads by amplification-based sequencing libraries. This bias might lead to sparse distributions of the ATAC-seq peaks and Hi-C contacts in the genomic regions that contain the elusive genes. However, only 3.00 and 9.00% of the elusive genes with H′ < 1 and H′ ≥ 1 were located in regions of extremely high GC content (>60%), respectively, showing that the elusive genes H′ ≥ 1 rather tend to contain more genes with high GC content (p=0.0176). Thus, the depleted epigenomic features in the genomic regions containing elusive genes are unlikely to be false discoveries caused by a technical issue, namely the underrepresentation of the sequencing reads.

Elusive gene orthologs in the chicken microchromosomes

The heterogeneous locations of the elusive genes can also be examined from a chromosome-scale viewpoint (Figure 7, Figure 7—figure supplement 1). The visualization via chromosome ideograms indicated an overlap of the elusive genes with the genomic regions enriched for the genes whose chicken orthologs are on the microchromosomes (chromosomes 11–38 and W), providing a statistical support for this trend (p=0.0175; Figure 7a). Indeed, microchromosomes of the chicken and other vertebrate exhibit genomic features including high GC content, high gene density, and rapid nucleotide substitutions in comparison with their macrochromosomes (Groenen et al., 2009; International Chicken Genome Sequencing Consortium, 2004; Schield et al., 2019; Waters et al., 2021), which also characterize genomic regions containing elusive genes. On the contrary, previous studies revealed that the chicken microchromosomes are preferentially located in the A compartments of the nucleus (Perry et al., 2020) and are early replicating (McQueen et al., 1998). These characteristics associated with the microchromosomes were opposite characteristics to the human genomic regions preferentially containing the elusive genes.

Figure 7. Chromosomal distribution of human elusive genes.

Red and dark blue horizontal bars beside the chromosome ideogram represent the location of elusive genes with restricted expression (Shannon’s H′ < 1) and the other elusive genes, respectively. (a) The chromosome diagrams are colored according to the density of the genes that harbor chicken orthologs in microchromosomes (number of genes/Mb). 93 and 68 elusive genes were orthologous to the chicken genes in macro- and microchromosomes, respectively, and 4211 and 2078 non-elusive genes were orthologous to the chicken genes in macro- and microchromosomes, respectively. This indicates that the chicken orthologs of the elusive genes are abundant in microchromosomes compared with those of the non-elusive genes (p=0.0175). (b) Gray regions in the diagram indicate orthologous regions of microchromosomes in the ancestors of gnathostomes (Nakatani et al., 2021). 395 and 296 elusive genes were located in the genomic regions corresponding to ancient macro- and microchromosomes, respectively, and 5950 and 1929 non-elusive genes were located in the genomic regions corresponding to these ancient chromosomes. The result recapitulates the biased localization of the elusive genes on microchromosomes (p=9.50 × 10-24). The chromosome diagrams were drawn using RIdeogram (Hao et al., 2020).

Figure 7.

Figure 7—figure supplement 1. Distribution of elusive genes across human chromosomes.

Figure 7—figure supplement 1.

Red and dark blue horizontal bars on the side of the chromosome diagram represent the location of elusive genes with restricted expression (Shannon’s H′ < 1) and the other elusive genes, respectively. (a) Karyotypes are shown by G-banding. Red regions indicate centromeres, acrocentric regions, and variable-length regions. (b) The chromosome diagrams are colored according to gene density (number of genes/Mb). The chromosome diagrams were drawn using RIdeogram (Hao et al., 2020).

We further analyzed the ATAC-seq peaks in the chicken genome and found more peaks in the genomic regions including the elusive gene orthologs than in those containing non-elusive gene orthologs in four samples out of eight and no significant differences in the peak density in the four remaining samples (Figure 6—figure supplement 5). These observations indicate that, in an epigenetic manner, the chicken orthologs of the elusive genes are not regulated to reduce their expression level. This idea was further supported by a comparison of the expression profiles between the chicken orthologs of the elusive and non-elusive genes, showing no significant differences between them (Figure 5—figure supplement 1). Our analyses indicate that the genomic features of the elusive genes such as high GC and high nucleotide substitutions do not always correlate with a reduction in pleiotropy of gene expression that potentially leads to an increase in functional dispensability in the course of vertebrate evolution. In addition, avian orthologs of the elusive genes did not show higher KA and KS values than those of the non-elusive genes (Figure 3, Figure 2—figure supplement 1), likely consistent with not significant difference in gene expression levels between them in the species (Figure 5—figure supplement 1; Cherry, 2010; Zhang and Yang, 2015). We further compared expression profiles between the orthologs of the human elusive and non-elusive genes in several non-mammalian vertebrates and found that the orthologs of the elusive genes tend to exhibit low pleiotropy in green anole, coelacanth, and gar but not in Western clawed frog. The result suggests that the low pleiotropy of the elusive genes has persisted at least since the bony vertebrate ancestors. With respect to the chicken genome, the ‘elusive’ features for the genes orthologous to human elusive genes might have been relaxed—functional importance of the orthologs has increased—during evolution leading to chicken.

Discussion

Here we identified elusive genes that were lost in multiple lineages during mammalian evolution using a comprehensive scan of gene phylogenies. To identify gene loss events, absence of evidence (i.e. missing genes caused by incomplete genome assemblies and gene annotations) should be reviewed meticulously (Deutekom et al., 2019). Additionally, gene loss might be detected erroneously because of failure in similarity searches for orthologs of rapidly evolving genes (Moyers and Zhang, 2015). In this study, we aimed to reduce these false discoveries through our multifaceted approaches (Figure 1). We selected those species with highly complete gene annotations through integration of multiple gene annotations. Using these improved gene annotations, we created orthologous groups by employing a highly sensitive homology search with MMSeqs2 (Steinegger and Söding, 2017) and merged them into those identified in the Ensembl database. Furthermore, we restricted the loss events that were observed as gene absence in all species examined within all hierarchical levels of the selected taxonomic groups (Figure 1b). This absence is likely to have occurred as a gene loss in the common ancestor of the particular taxon rather than as a false discovery of gene loss in the individual species independently. Genuine continuous (e.g. telomere-to-telomere) genome assemblies are now available using modern sequencing technologies (Nurk et al., 2022). These genomic assemblies may help relieve the labor of examining for information losses, thereby facilitating the identification of genuine gene loss in any given species.

In the human genome, the elusive genes and their flanking regions harbor particular characteristics, including high GC content and high gene density, that may have originated long before the emergence of mammals (Figure 3). Frequent synonymous variations across modern humans in the elusive genes, consistent with higher synonymous substitution rates between the vertebrate orthologs, suggest that the genomic regions including elusive genes have been subject to rapid evolution for approximately 500 million years (Figures 2 and 4). Our findings indicate that heterogeneous genomic characteristics potentially affect the fate of genes at the latest period of vertebrate evolution. Analyses with large numbers of germline mutations in the human genome have illustrated the heterogeneity of mutation rates (Campbell and Eichler, 2013; Seplyarskiy and Sunyaev, 2021; Terekhanova et al., 2017). High GC content in the elusive genes may have facilitated an elevation of the mutation rate, as observed in the enrichment of rare variants in high-GC regions in the human genome (Schaibley et al., 2013). In addition, some of the elusive genes appear to have retained particular epigenetic marks including sparse ATAC-seq peaks, late replication timings, and location within LADs (Figure 6—figure supplements 14); these epigenetic marks are relevant to an increase in the mutation rate. Genomic regions with late replication timing exhibit increased mutation rates because of their unstable structure during the S-phase of the cell cycle (Koren et al., 2012; Stamatoyannopoulos et al., 2009). LADs retain more G-to-T mutations because of their susceptibility to oxidative damage in the nuclear periphery resulting in high levels of 8-oxoguanine (Yoshihara et al., 2014). Close coordination of the studies on gene evolution with germline mutation repertoires and spectra, which can be approximated from the collection of de novo mutations obtained by trio sequencing, may further facilitate our understanding of gene fates driven by heterogeneous genomic features—this would be viewed as ‘mutation-driven’ evolution (Nei, 2013).

The epigenetic marks of elusive genes are relevant to the suppression of gene expression (van Steensel and Belmont, 2017), and indeed, these genes harbor weakened and spatially restricted expression profiles (Figures 5 and 6 and Figure 6—figure supplements 14). However, the genomic features associated with these epigenetic marks usually exhibit lower GC contents and reduced gene density (Gilbert et al., 2004; Rao et al., 2014; van Steensel and Belmont, 2017). This discrepancy may be caused in part by a gain of local heterochromatin accompanied by suppression of the expression of transposable elements, as observed in various eukaryotic genomes (Choi and Lee, 2020; Fiston-Lavier et al., 2007; Grewal and Jia, 2007; Rangasamy, 2013; Slotkin and Martienssen, 2007; Underwood et al., 2017). Previous analyses showed frequent heterochromatinization of the human genomic regions where KRAB zinc finger genes colocalize with L1 retrotransposons (Imbeault et al., 2017; O’Geen et al., 2007; Vogel et al., 2006). One of the genomic regions found in human chromosome region 19p12 also contains many elusive genes (Vogel et al., 2006; Figure 7). Closer attention to the local gene and repeat contents including repetitive elements and tandem gene clusters might facilitate our understanding of heterochromatinization in restricted genomic regions, although we excluded such gene clusters in our search for elusive genes (Figure 1a).

A chromosomal-scale view of the distribution of elusive genes illuminated their significant correlation with the genes whose chicken orthologs are located on microchromosomes (Figure 7a). More importantly, genomic regions rich in elusive genes were traced back to the microchromosomes of the ancestral gnathostomes by reconstructing chromosomes of the ancestral genomes (Figure 7b). This inference of ancestral karyotypes augments our observations that some elusive natures of genomic sequences have been retained for hundreds of millions of years (Figure 3). In other words, the result suggests that the disparity of genomic regions that allows the ‘elusiveness’ for the genes has been retained during vertebrate evolution. On the other hand, comparisons of the expression profiles between the orthologs of the elusive and non-elusive genes for non-mammalian vertebrates suggest that the orthologs of the elusive genes have been associated with a reduction in pleiotropy of gene expression since vertebrate ancestors but acquired the diverse expressions in chicken and frog (Figure 5—figure supplement 1). Additionally, in the chicken genome, the diverse expressions of the chicken orthologs of the human elusive genes may be correlated with the abundance of ATAC-seq peaks (Figure 6—figure supplement 5). These findings again suggest that the chicken orthologs of the human elusive genes have increased pleiotropy of gene expression, which may lead to a lineage-specific acquisition of functional indispensability. It should be noted that the choices of tissues used in these analyses were largely different between the human and non-mammalian vertebrates (Tables S3 and S4; Supplementary file 1b and c). The chicken ATAC-seq data could be obtained only from developing embryos, while the human ATAC-seq in ENCODE were performed with cell lines. Therefore, the aforementioned interpretation should be treated carefully.

Finally, we note the potential evolutionary courses that facilitate the transition of gene fate from retention to loss. One possible course is a decrease in essential functions because of rapid sequence evolution in local genomic regions. The elusive genes located in those genomic regions with rapidly evolving characteristics are likely to accumulate neutral or even moderately harmful mutations in coding regions frequently, resulting in impaired essential functions. Another factor is the spatiotemporal suppression of gene expression via epigenetic constraints. Previous studies showed that lowly expressed genes are associated with low functional essentiality (Cherry, 2010; Gout et al., 2010), as shown for elusive genes in our study. Elusive genes with reduced pleiotropy may have limited opportunities to function, potentially leading to loss of their important roles. The extent of these evolutionary forces may have varied with time and lineages, resulting in a patchy loss of elusive genes phylogenetically. Interestingly, a recent large-scale scan of de novo mutations in Arabidopsis indicates the association of mutation rates with epigenetic features and functional essentiality of genes (Monroe et al., 2022). Further investigation of the association of genes with the surrounding genomic regions in various taxa may provide a common understanding of genomic and epigenomic features that potentially alter the fate of genes. Although epigenetic features are plastic, our findings indicate that the disparities of genomic regions are reflected in the heterogeneity of evolutionary forces and have been retained for hundreds of millions of years. This idea prompts us to explore evolutionary constraints on more global genomic regions that are potentially associated with structural characteristics including chromosomal composition and locations within the nucleus.

Materials and methods

Sequence retrieval

We retrieved genome assemblies and gene annotations of 114 mammals and 132 non-mammal vertebrates from RefSeq (accessed on April 9, 2018), Ensembl release 92, and the repositories of the individual genome projects (Supplementary Table S1 in Supplementary file 1a). Gene annotations for a single species from multiple repositories were integrated into one as follows. When gene annotations of multiple repositories were referring to the same version of the genome assembly, the annotation GTF files were merged with the ‘cuffcompare’ tool (Trapnell et al., 2012). Otherwise, translated amino acid sequences were clustered by CD-HIT v. 4.6.4 (Fu et al., 2012) with 100% sequence similarity, and the representative sequence for each cluster was retrieved by assuming that each cluster represented a single locus. Subsequently, we selected the canonical amino acid sequence for each locus: canonical peptides of the Ensembl genes were retrieved from the Ensembl database; for other resources, the longest amino acid sequence from the isoforms of a locus was chosen. The completeness of the gene annotations was performed on the gVolante web server with assessments by BUSCO v.2 (Simão et al., 2015) by referring to the vertebrate ortholog sets provided by BUSCO and CVG (Hara et al., 2015). The gene annotations of mammals, birds, and ray-finned fishes that had fewer than 1% missing genes, as well as those of the other vertebrates with fewer than 3% missing genes, were selected. Exceptionally, the gene annotations of Gavialis gangeticus (Reptilia; CVG missing ratio 3.86%), Paroedura picta (Reptilia; BUSCO vertebrate ortholog missing rate 3.25%), and Scyliorhinus torazame (Chondrichthyes; BUSCO vertebrate ortholog missing rate 4.45%) were added. Finally, the amino acid sequence set of 90 mammals and 101 non-mammalian vertebrates was subjected to t ortholog clustering. We also retrieved coding nucleotide sequences of the canonical amino acid sequences.

Ortholog clustering and tree inference

We retrieved gene trees of human protein-coding genes and their homologs from Ensembl Gene Tree release 92. From these gene trees, we constructed an amino acid sequence set of the homologs consisting of the species selected in the above section. This sequence set, restricted to Ensembl sequences only, was used as the ‘backbone’ of the ortholog set of all the selected species. In addition, we generated ortholog groups for all the species used by employing OrthoFinder2 v. 2.3.3 (Emms and Kelly, 2019) based on the similarity of amino acid sequences: a sequence similarity search was performed using MMSeqs2 v. 2339462c06eab0bee64e4fc0ebebf7707f6e53fd (Steinegger and Söding, 2017). The Ensembl and OrthoFinder ortholog sets were then merged to create the united set of ortholog groups, yielding 50,768 vertebrate ortholog groups.

The integrated ortholog groups were then subjected to molecular phylogenetic analysis. Amino acid sequences of the individual groups were aligned with MAFFT v. 7.402 (Katoh and Standley, 2013), and ambiguous alignment sites were removed with trimAl v1.4 (Capella-Gutiérrez et al., 2009). Phylogenetic trees were inferred with IQ-Tree v. 1.6.6 (Nguyen et al., 2015) by selecting the optimal amino acid substitution model with ModelFinder (Kalyaanamoorthy et al., 2017) implemented in the IQ-Tree tool for each sequence alignment. In the inferred phylogenetic trees, ambiguously bifurcated nodes—those with branch lengths less than 0.0025—were collapsed into a multifurcational node by the ‘di2multi’ function implemented in ape v. 5.5 (Paradis and Schliep, 2019). The trees were then rooted with the automatic rooting function ‘get_age_balanced_outgroup’ implemented in ete3 v. 3.1.1 (Huerta-Cepas et al., 2016) to minimize any discrepancy of tree topologies with the taxonomic hierarchy of the species included. Using the ortholog groups, the age of individual genes was estimated by inferring the oldest evolutionary lineage in the gene trees. We also adopted the gene age instructed by the Ensembl Gene Tree, wherever it shows an older age.

Identification of elusive genes in the human genome

For the individual trees, orthologs of the human genes were detected by the ‘get_my_evol_events’ function in ete3 (Huerta-Cepas et al., 2007). This function inferred gene duplication nodes in the rooted trees, resulting in separation of the trees into 17,495 subtrees of mammalian ortholog groups containing human genes. The ortholog information was referenced to extract the species with no orthologs to humans. This absence was further assessed by the ortholog annotation of human genes in the Ensembl Gene Tree database.

We selected taxonomic groups for the individual mammalian ortholog groups in which the orthologs were missing in all the species examined (Table S1; Supplementary file 1a). We restricted our study to gene losses that were likely to have occurred in the common ancestor of particular taxonomic groups, rather than those arising from the incompleteness of gene annotations. When a gene was missing in all the taxonomic groups in the same hierarchy, we considered that the gene was lost in the common ancestor of these groups. Finally, we found 1233 human genes belonging to the ortholog groups that were absent in two or more taxonomic groups and defined them as elusive genes. The gene loss events inferred by molecular phylogeny were further assessed by synteny-based ortholog annotations implemented in RefSeq, as well as a homolog search in the genome assemblies (Table S1; Supplementary file 1a) with TBLASTN v2.11.0+ (Altschul et al., 1997) and MMSeqs2 (Steinegger and Söding, 2017) referring to the latest RefSeq gene annotations (last accessed on December 2, 2022). This procedure resulted in the identification of 813 elusive genes that harbored three or fewer duplicates. Similarly, we extracted 8050 human genes whose orthologs were found in all the mammalian species examined and defined them as non-elusive genes. Because these elusive and non-elusive genes were identified in the GRCh38 human genome assembly, we performed the following analyses using this assembly.

Extraction of genomic and molecular evolutionary characteristics

We calculated the GC content of a gene by using its genomic region including introns and untranslated regions (UTRs). To calculate individual gene densities, we extracted genomic regions containing the genes and their flanking three genes at both ends and divided them by seven. The orthologs of the elusive and non-elusive genes were retrieved from the aforementioned gene trees. We computed KA and KS values of the ortholog pairs of human–chimpanzee, human–mouse, chicken–turkey (Meleagris gallopavo), central bearded dragon-green anole (Anolis carolinensis), and bamboo shark-whale shark (Rhincodon typus). To achieve this, we extracted ortholog groups that contained at least three of these ortholog pairs. Amino acid sequences of the human and the orthologs were aligned using MAFFT. Nucleotide sequence alignments of the coding regions were generated by ‘back-translation’ of the amino acid sequence alignments by trimAl, simultaneously removing ambiguous alignment sites. By employing coding nucleotide sequence alignments, numbers of synonymous and non-synonymous substitutions per site were computed using PAML v. 4.9a (Yang, 2007). To compute nucleotide sequence differences of the individual introns, we extracted 473 elusive and 4626 non-elusive genes that harbored introns aligned with the chimpanzee genome assembly. The nucleotide differences were calculated via the whole genome alignments of hg38 and panTro6 retrieved from the UCSC genome browser.

Multiomics analysis

Common and rare SNVs of the human populations were retrieved from dbSNP release 147 (Sherry et al., 2001), and human CNVs were obtained from the Database of Genomic Variants (DGV) release 2016-08-31 (MacDonald et al., 2014). The CNVs were classified into duplication and deletion variants, according to the annotation in DGV. The density of these variants in a gene was computed by dividing the number of variants identified in a gene region by its sequence length. Z-scores, indices of the tolerance against mutations, of synonymous, missense, and loss-of-function mutations of the individual genes were retrieved from gnomAD v. 2.1.1 (Karczewski et al., 2021).

Gene expression quantifications of adult and fetal tissues were retrieved from public databases. Expression profiles of adult tissues were obtained from the GTEx database v. 8 (The GTEx Consortium et al., 2020), computed by averaging TPM values across individuals. Expression profiles of fetal tissues were obtained from the Developmental Single Cell Atlas of gene Regulation and Expression (Descartes) portal (Cao et al., 2020) by calculating averaged TPM values of single cells. The maximum TPM values of the individual genes among the tissues were taken as the representative gene expression levels. As a proxy of the spatial diversity of gene expression, Shannon’s species diversity index (H′ values) was computed for each gene using the following equation:

Hi`=-k=1Rpi,klnpi,k

where Hi′ represents the Shannon’s index of ith gene in the list of the human genes, pi,k represents the proportion of the TPM values of the ith gene in the kth tissues/cell types, and R denotes the total number of tissues/cell types examined.

The ATAC-seq peaks and TAD boundaries of the human primary cells and culture strains were retrieved from the ENCODE 3 repository (Accession ID listed in Table S3; Supplementary file 1b; The ENCODE Project Consortium, 2012). Wavelet-smoothed signals of the ENCODE Repli-seq data were obtained from the UCSC genome browser (Hansen et al., 2010). The 20 kb bin-associated domains of LAD-seq that employed Lamin B1 antibodies (van Schaik et al., 2020) were retrieved from the 4D Nucleome Data Portal.

We also compared expression profiles and ATAC-seq peak densities between the orthologs of the elusive and non-elusive genes in non-mammalian vertebrates in a similar way as we did with the human datasets. Normalized gene expression profiles from RNA-seq data of normal adult tissues and early embryos for chicken, green anole, Western clawed frog, coelacanth, and spotted gar were obtained from the Bgee version 15 database (Bastian et al., 2021; Table S4; Supplementary file 1c). ATAC-seq narrow peak signals of chicken tissues were retrieved from NCBI GEO (Table S4; Supplementary file 1c) followed by coordination of the genome assembly with galGal5 with the UCSC liftOver tool (Hinrichs et al., 2006) as needed.

Code availability

The scripts for inferring gene presence and absence from gene trees were deposited in GitHub (https://github.com/yuichiroharajpn/ElusiveGenes, copy archived at Hara, 2022).

Statistical tests

Comparisons of the genomic characteristics between the elusive and non-elusive genes were tested statistically with the nonparametric Mann–Whitney U test and Fisher’s exact test implemented in R. Correction of multiple testing was performed using the Benjamini–Hochberg FDR approach. We considered p < 0.05 to be statistically significant.

Acknowledgements

We would like to thank Yoichiro Nakatani for providing the information of orthologous regions of ancestral chromosomes in the human genome, and Hideya Kawaji and Ichiro Hiratani for insightful comments. We also would like to thank the peer reviewers for their constructive feedback and insightful comments, which have significantly improved this paper. This work was supported by RIKEN to SK, JSPS KAKENHI under Grant Number 20H03269 to SK and 21K06132 to YH, AMED under Grant Number JP21wm0325050 to YH, and Mochida Memorial Foundation for Medical and Pharmaceutical Research to YH. Computations were partially performed on the NIG supercomputer at the ROIS National Institute of Genetics.

Funding Statement

The funders had no role in study design, data collection and interpretation, or the decision to submit the work for publication.

Contributor Information

Yuichiro Hara, Email: hara-yi@igakuken.or.jp.

Wenfeng Qian, Chinese Academy of Sciences, China.

George H Perry, Pennsylvania State University, United States.

Funding Information

This paper was supported by the following grants:

  • Japan Society for the Promotion of Science KAKENHI to Yuichiro Hara.

  • Japan Society for the Promotion of Science KAKENHI to Shigehiro Kuraku.

  • Mochida Memorial Foundation for Medical and Pharmaceutical Research to Yuichiro Hara.

  • Japan Society for the Promotion of Science 21K06132 to Yuichiro Hara.

  • Japan Society for the Promotion of Science 20H03269 to Shigehiro Kuraku.

  • Japan Agency for Medical Research and Development JP21wm0325050 to Yuichiro Hara.

Additional information

Competing interests

No competing interests declared.

Reviewing editor, eLife.

Author contributions

Conceptualization, Data curation, Funding acquisition, Writing – original draft, Writing – review and editing, Investigation, Methodology.

Conceptualization, Supervision, Writing – review and editing, Funding acquisition.

Additional files

Supplementary file 1. Supplementary Tables S1, S3, S4.

(a) Supplementary Table S1. Vertebrate species used for creating gene phylogenies. (b) Supplementary Table S3. ENCODE accession ID list used for epigenomic analyses. (c) Supplementary Table S4. RNA-seq and ATAC-seq samples of non-mammalian vertebrates.

elife-82290-supp1.xlsx (38KB, xlsx)
Supplementary file 2. Supplementary Table S2.

Characteristics of the elusive and non-elusive genes in the human genome.

elife-82290-supp2.zip (1.8MB, zip)
MDAR checklist

Data availability

The current manuscript is a computational study, so no data have been generated for this manuscript. Data from ENCODE Project was used and is available at https://www.encodeproject.org, IDs are shown in Table S3 in Supplementary File 1.

The following previously published dataset was used:

van Schaik T, Vos M, Peric-Hupkes D, Hn Celie P, van Steensel B. 2020. Dara from: Cell cycle dynamics of lamina-associated DNA. 4D Nucleome Data Portal. f1218a92-1f37-4519-85d6-ccedd5f7ad39

References

  1. Albalat R, Cañestro C. Evolution by gene loss. Nature Reviews. Genetics. 2016;17:379–391. doi: 10.1038/nrg.2016.39. [DOI] [PubMed] [Google Scholar]
  2. Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped blast and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bartha I, di Iulio J, Venter JC, Telenti A. Human gene essentiality. Nature Reviews. Genetics. 2018;19:51–62. doi: 10.1038/nrg.2017.75. [DOI] [PubMed] [Google Scholar]
  4. Bastian FB, Roux J, Niknejad A, Comte A, Fonseca Costa SS, de Farias TM, Moretti S, Parmentier G, de Laval VR, Rosikiewicz M, Wollbrett J, Echchiki A, Escoriza A, Gharib WH, Gonzales-Porta M, Jarosz Y, Laurenczy B, Moret P, Person E, Roelli P, Sanjeev K, Seppey M, Robinson-Rechavi M. The bgee suite: integrated curated expression atlas and comparative transcriptomics in animals. Nucleic Acids Research. 2021;49:D831–D847. doi: 10.1093/nar/gkaa793. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bernardi G, Bernardi G. Compositional constraints and genome evolution. Journal of Molecular Evolution. 1986;24:1–11. doi: 10.1007/BF02099946. [DOI] [PubMed] [Google Scholar]
  6. Blanc G, Agarkova I, Grimwood J, Kuo A, Brueggeman A, Dunigan DD, Gurnon J, Ladunga I, Lindquist E, Lucas S, Pangilinan J, Pröschold T, Salamov A, Schmutz J, Weeks D, Yamada T, Lomsadze A, Borodovsky M, Claverie J-M, Grigoriev IV, Van Etten JL. The genome of the polar eukaryotic microalga coccomyxa subellipsoidea reveals traits of cold adaptation. Genome Biology. 2012;13:R39. doi: 10.1186/gb-2012-13-5-r39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Blomme T, Vandepoele K, De Bodt S, Simillion C, Maere S, Van de Peer Y. The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biology. 2006;7:R43. doi: 10.1186/gb-2006-7-5-r43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Botero-Castro F, Figuet E, Tilak MK, Nabholz B, Galtier N. Avian genomes revisited: hidden genes uncovered and the rates versus traits paradox in birds. Molecular Biology and Evolution. 2017;34:3123–3131. doi: 10.1093/molbev/msx236. [DOI] [PubMed] [Google Scholar]
  9. Byrne KP, Wolfe KH. Consistent patterns of rate asymmetry and gene loss indicate widespread neofunctionalization of yeast genes after whole-genome duplication. Genetics. 2007;175:1341–1350. doi: 10.1534/genetics.106.066951. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Campbell CD, Eichler EE. Properties and rates of germline mutations in humans. Trends in Genetics. 2013;29:575–584. doi: 10.1016/j.tig.2013.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Cao J, O’Day DR, Pliner HA, Kingsley PD, Deng M, Daza RM, Zager MA, Aldinger KA, Blecher-Gonen R, Zhang F, Spielmann M, Palis J, Doherty D, Steemers FJ, Glass IA, Trapnell C, Shendure J. A human cell atlas of fetal gene expression. Science. 2020;370:eaba7721. doi: 10.1126/science.aba7721. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Capella-Gutiérrez S, Silla-Martínez JM, Gabaldón T. TrimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Cherry JL. Expression level, evolutionary rate, and the cost of expression. Genome Biology and Evolution. 2010;2:757–769. doi: 10.1093/gbe/evq059. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Choi JY, Lee YCG. Double-Edged sword: the evolutionary consequences of the epigenetic silencing of transposable elements. PLOS Genetics. 2020;16:e1008872. doi: 10.1371/journal.pgen.1008872. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Cohen N, Dagan T, Stone L, Graur D. Gc composition of the human genome: in search of isochores. Molecular Biology and Evolution. 2005;22:1260–1272. doi: 10.1093/molbev/msi115. [DOI] [PubMed] [Google Scholar]
  16. Cortez D, Marin R, Toledo-Flores D, Froidevaux L, Liechti A, Waters PD, Grützner F, Kaessmann H. Origins and functional evolution of Y chromosomes across mammals. Nature. 2014;508:488–493. doi: 10.1038/nature13151. [DOI] [PubMed] [Google Scholar]
  17. Debatisse M, Le Tallec B, Letessier A, Dutrillaux B, Brison O. Common fragile sites: mechanisms of instability revisited. Trends in Genetics. 2012;28:22–32. doi: 10.1016/j.tig.2011.10.003. [DOI] [PubMed] [Google Scholar]
  18. Deutekom ES, Vosseberg J, van Dam TJP, Snel B. Measuring the impact of gene prediction on gene loss estimates in eukaryotes by quantifying falsely inferred absences. PLOS Computational Biology. 2019;15:e1007301. doi: 10.1371/journal.pcbi.1007301. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biology. 2019;20:238. doi: 10.1186/s13059-019-1832-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Fernández R, Gabaldón T. Gene gain and loss across the metazoan tree of life. Nature Ecology & Evolution. 2020;4:524–533. doi: 10.1038/s41559-019-1069-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Fiston-Lavier AS, Anxolabehere D, Quesneville H. A model of segmental duplication formation in Drosophila melanogaster. Genome Research. 2007;17:1458–1470. doi: 10.1101/gr.6208307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28:3150–3152. doi: 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Giaever G, Nislow C. The yeast deletion collection: a decade of functional genomics. Genetics. 2014;197:451–465. doi: 10.1534/genetics.114.161620. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Gilbert N, Boyle S, Fiegler H, Woodfine K, Carter NP, Bickmore WA. Chromatin architecture of the human genome: gene-rich domains are enriched in open chromatin fibers. Cell. 2004;118:555–566. doi: 10.1016/j.cell.2004.08.011. [DOI] [PubMed] [Google Scholar]
  25. Gout JF, Kahn D, Duret L, Pritchard JK, Paramecium Post-Genomics Consortium The relationship among gene expression, the evolution of gene dosage, and the rate of protein evolution. PLOS Genetics. 2010;6:e1000944. doi: 10.1371/journal.pgen.1000944. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Grewal SIS, Jia S. Heterochromatin revisited. Nature Reviews. Genetics. 2007;8:35–46. doi: 10.1038/nrg2008. [DOI] [PubMed] [Google Scholar]
  27. Groenen MAM, Wahlberg P, Foglio M, Cheng HH, Megens HJ, Crooijmans R, Besnier F, Lathrop M, Muir WM, Wong GKS, Gut I, Andersson L. A high-density SNP-based linkage map of the chicken genome reveals sequence features correlated with recombination rate. Genome Research. 2009;19:510–519. doi: 10.1101/gr.086538.108. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Guijarro-Clarke C, Holland PWH, Paps J. Widespread patterns of gene loss in the evolution of the animal Kingdom. Nature Ecology & Evolution. 2020;4:519–523. doi: 10.1038/s41559-020-1129-2. [DOI] [PubMed] [Google Scholar]
  29. Hansen RS, Thomas S, Sandstrom R, Canfield TK, Thurman RE, Weaver M, Dorschner MO, Gartler SM, Stamatoyannopoulos JA. Sequencing newly replicated DNA reveals widespread plasticity in human replication timing. PNAS. 2010;107:139–144. doi: 10.1073/pnas.0912402107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Hao Z, Lv D, Ge Y, Shi J, Weijers D, Yu G, Chen J. rideogram: drawing SVG graphics to visualize and MAP genome-wide data on the idiograms. PeerJ. Computer Science. 2020;6:e251. doi: 10.7717/peerj-cs.251. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Hara Y, Tatsumi K, Yoshida M, Kajikawa E, Kiyonari H, Kuraku S. Optimizing and benchmarking de novo transcriptome sequencing: from library preparation to assembly evaluation. BMC Genomics. 2015;16:977. doi: 10.1186/s12864-015-2007-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Hara Y, Takeuchi M, Kageyama Y, Tatsumi K, Hibi M, Kiyonari H, Kuraku S. Madagascar ground gecko genome analysis characterizes asymmetric fates of duplicated genes. BMC Biology. 2018a;16:40. doi: 10.1186/s12915-018-0509-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Hara Y, Yamaguchi K, Onimaru K, Kadota M, Koyanagi M, Keeley SD, Tatsumi K, Tanaka K, Motone F, Kageyama Y, Nozu R, Adachi N, Nishimura O, Nakagawa R, Tanegashima C, Kiyatake I, Matsumoto R, Murakumo K, Nishida K, Terakita A, Kuratani S, Sato K, Hyodo S, Kuraku S. Shark genomes provide insights into elasmobranch evolution and the origin of vertebrates. Nature Ecology & Evolution. 2018b;2:1761–1771. doi: 10.1038/s41559-018-0673-5. [DOI] [PubMed] [Google Scholar]
  34. Hara Y. Elusivegenes. swh:1:rev:4c4d279c77c838ec6acb86032bdb51c514c3cc60Software Heritage. 2022 https://archive.softwareheritage.org/swh:1:dir:67291d403a94399fcb9479be45d35fee4585becc;origin=https://github.com/yuichiroharajpn/ElusiveGenes;visit=swh:1:snp:d53bb1fab81dbd98069b50871fd11e474fe72156;anchor=swh:1:rev:4c4d279c77c838ec6acb86032bdb51c514c3cc60
  35. Helmrich A, Stout-Weider K, Hermann K, Schrock E, Heiden T. Common fragile sites are conserved features of human and mouse chromosomes and relate to large active genes. Genome Research. 2006;16:1222–1230. doi: 10.1101/gr.5335506. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Hinrichs AS, Karolchik D, Baertsch R, Barber GP, Bejerano G, Clawson H, Diekhans M, Furey TS, Harte RA, Hsu F, Hillman-Jackson J, Kuhn RM, Pedersen JS, Pohl A, Raney BJ, Rosenbloom KR, Siepel A, Smith KE, Sugnet CW, Sultan-Qurraie A, Thomas DJ, Trumbower H, Weber RJ, Weirauch M, Zweig AS, Haussler D, Kent WJ. The UCSC genome browser database: update 2006. Nucleic Acids Research. 2006;34:D590–D598. doi: 10.1093/nar/gkj144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Hirsh AE, Fraser HB. Protein dispensability and rate of evolution. Nature. 2001;411:1046–1049. doi: 10.1038/35082561. [DOI] [PubMed] [Google Scholar]
  38. Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T. The human phylome. Genome Biology. 2007;8:R109. doi: 10.1186/gb-2007-8-6-r109. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Huerta-Cepas J, Serra F, Bork P. Ete 3: reconstruction, analysis, and visualization of phylogenomic data. Molecular Biology and Evolution. 2016;33:1635–1638. doi: 10.1093/molbev/msw046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Hughes JF, Skaletsky H, Brown LG, Pyntikova T, Graves T, Fulton RS, Dugan S, Ding Y, Buhay CJ, Kremitzki C, Wang Q, Shen H, Holder M, Villasana D, Nazareth LV, Cree A, Courtney L, Veizer J, Kotkiewicz H, Cho TJ, Koutseva N, Rozen S, Muzny DM, Warren WC, Gibbs RA, Wilson RK, Page DC. Strict evolutionary conservation followed rapid gene loss on human and rhesus Y chromosomes. Nature. 2012;483:82–86. doi: 10.1038/nature10843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Imbeault M, Helleboid PY, Trono D. Krab zinc-finger proteins contribute to the evolution of gene regulatory networks. Nature. 2017;543:550–554. doi: 10.1038/nature21683. [DOI] [PubMed] [Google Scholar]
  42. International Chicken Genome Sequencing Consortium Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature. 2004;432:695–716. doi: 10.1038/nature03154. [DOI] [PubMed] [Google Scholar]
  43. Jordan IK, Rogozin IB, Wolf YI, Koonin EV. Essential genes are more evolutionarily conserved than are nonessential genes in bacteria. Genome Research. 2002;12:962–968. doi: 10.1101/gr.87702. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nature Methods. 2017;14:587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Karczewski KJ, Francioli LC, Tiao G, Cummings BB, Alföldi J, Wang Q, Collins RL, Laricchia KM, Ganna A, Birnbaum DP, Gauthier LD, Brand H, Solomonson M, Watts NA, Rhodes D, Singer-Berk M, England EM, Seaby EG, Kosmicki JA, Walters RK, Tashman K, Farjoun Y, Banks E, Poterba T, Wang A, Seed C, Whiffin N, Chong JX, Samocha KE, Pierce-Hoffman E, Zappala Z, O’Donnell-Luria AH, Minikel EV, Weisburd B, Lek M, Ware JS, Vittal C, Armean IM, Bergelson L, Cibulskis K, Connolly KM, Covarrubias M, Donnelly S, Ferriera S, Gabriel S, Gentry J, Gupta N, Jeandet T, Kaplan D, Llanwarne C, Munshi R, Novod S, Petrillo N, Roazen D, Ruano-Rubio V, Saltzman A, Schleicher M, Soto J, Tibbetts K, Tolonen C, Wade G, Talkowski ME, Genome Aggregation Database Consortium. Neale BM, Daly MJ, MacArthur DG. Author correction: the mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2021;590:E53. doi: 10.1038/s41586-020-03174-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Katoh K, Standley DM. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution. 2013;30:772–780. doi: 10.1093/molbev/mst010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Katzman S, Capra JA, Haussler D, Pollard KS. Ongoing GC-biased evolution is widespread in the human genome and enriched near recombination hot spots. Genome Biology and Evolution. 2011;3:614–626. doi: 10.1093/gbe/evr058. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Koren A, Polak P, Nemesh J, Michaelson JJ, Sebat J, Sunyaev SR, McCarroll SA. Differential relationship of DNA replication timing to different forms of human mutation and variation. The American Journal of Human Genetics. 2012;91:1033–1040. doi: 10.1016/j.ajhg.2012.10.018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Korenberg JR, Rykowski MC. Human genome organization: Alu, lines, and the molecular structure of metaphase chromosome bands. Cell. 1988;53:391–400. doi: 10.1016/0092-8674(88)90159-6. [DOI] [PubMed] [Google Scholar]
  50. Krylov DM, Wolf YI, Rogozin IB, Koonin EV. Gene loss, protein sequence divergence, gene dispensability, expression level, and interactivity are correlated in eukaryotic evolution. Genome Research. 2003;13:2229–2235. doi: 10.1101/gr.1589103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Lewin TD, Royall AH, Holland PWH. Dynamic molecular evolution of mammalian homeobox genes: duplication, loss, divergence and gene conversion sculpt PRD class repertoires. Journal of Molecular Evolution. 2021;89:396–414. doi: 10.1007/s00239-021-10012-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Liu G, Yong MYJ, Yurieva M, Srinivasan KG, Liu J, Lim JSY, Poidinger M, Wright GD, Zolezzi F, Choi H, Pavelka N, Rancati G. Gene essentiality is a quantitative property linked to cellular evolvability. Cell. 2015;163:1388–1399. doi: 10.1016/j.cell.2015.10.069. [DOI] [PubMed] [Google Scholar]
  53. MacDonald JR, Ziman R, Yuen RKC, Feuk L, Scherer SW. The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Research. 2014;42:D986–D992. doi: 10.1093/nar/gkt958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Maclean CJ, Metzger BPH, Yang JR, Ho WC, Moyers B, Zhang J. Deciphering the genic basis of yeast fitness variation by simultaneous forward and reverse genetics. Molecular Biology and Evolution. 2017;34:2486–2502. doi: 10.1093/molbev/msx151. [DOI] [PubMed] [Google Scholar]
  55. Maeso I, Dunwell TL, Wyatt CDR, Marlétaz F, Vető B, Bernal JA, Quah S, Irimia M, Holland PWH. Evolutionary origin and functional divergence of totipotent cell homeobox genes in eutherian mammals. BMC Biology. 2016;14:45. doi: 10.1186/s12915-016-0267-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  56. McQueen HA, Siriaco G, Bird AP. Chicken microchromosomes are hyperacetylated, early replicating, and gene rich. Genome Research. 1998;8:621–630. doi: 10.1101/gr.8.6.621. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Medstrand P, van de Lagemaat LN, Mager DL. Retroelement distributions in the human genome: variations associated with age and proximity to genes. Genome Research. 2002;12:1483–1495. doi: 10.1101/gr.388902. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Miyata T, Yasunaga T, Nishida T. Nucleotide sequence divergence and functional constraint in mRNA evolution. PNAS. 1980;77:7328–7332. doi: 10.1073/pnas.77.12.7328. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Monroe JG, Srikant T, Carbonell-Bejerano P, Becker C, Lensink M, Exposito-Alonso M, Klein M, Hildebrandt J, Neumann M, Kliebenstein D, Weng ML, Imbert E, Ågren J, Rutter MT, Fenster CB, Weigel D. Mutation bias reflects natural selection in Arabidopsis thaliana. Nature. 2022;602:101–105. doi: 10.1038/s41586-021-04269-6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  60. Moyers BA, Zhang J. Phylostratigraphic bias creates spurious patterns of genome evolution. Molecular Biology and Evolution. 2015;32:258–267. doi: 10.1093/molbev/msu286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  61. Nakatani Y, Shingate P, Ravi V, Pillai NE, Prasad A, McLysaght A, Venkatesh B. Reconstruction of proto-vertebrate, proto-cyclostome and proto-gnathostome genomes provides new insights into early vertebrate evolution. Nature Communications. 2021;12:4489. doi: 10.1038/s41467-021-24573-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  62. Nei M. Mutation-Driven Evolution. Oxford University Press; 2013. [Google Scholar]
  63. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  64. Nurk S, Koren S, Rhie A, Rautiainen M, Bzikadze AV, Mikheenko A, Vollger MR, Altemose N, Uralsky L, Gershman A, Aganezov S, Hoyt SJ, Diekhans M, Logsdon GA, Alonge M, Antonarakis SE, Borchers M, Bouffard GG, Brooks SY, Caldas GV, Chen N-C, Cheng H, Chin C-S, Chow W, de Lima LG, Dishuck PC, Durbin R, Dvorkina T, Fiddes IT, Formenti G, Fulton RS, Fungtammasan A, Garrison E, Grady PGS, Graves-Lindsay TA, Hall IM, Hansen NF, Hartley GA, Haukness M, Howe K, Hunkapiller MW, Jain C, Jain M, Jarvis ED, Kerpedjiev P, Kirsche M, Kolmogorov M, Korlach J, Kremitzki M, Li H, Maduro VV, Marschall T, McCartney AM, McDaniel J, Miller DE, Mullikin JC, Myers EW, Olson ND, Paten B, Peluso P, Pevzner PA, Porubsky D, Potapova T, Rogaev EI, Rosenfeld JA, Salzberg SL, Schneider VA, Sedlazeck FJ, Shafin K, Shew CJ, Shumate A, Sims Y, Smit AFA, Soto DC, Sović I, Storer JM, Streets A, Sullivan BA, Thibaud-Nissen F, Torrance J, Wagner J, Walenz BP, Wenger A, Wood JMD, Xiao C, Yan SM, Young AC, Zarate S, Surti U, McCoy RC, Dennis MY, Alexandrov IA, Gerton JL, O’Neill RJ, Timp W, Zook JM, Schatz MC, Eichler EE, Miga KH, Phillippy AM. The complete sequence of a human genome. Science. 2022;376:44–53. doi: 10.1126/science.abj6987. [DOI] [PMC free article] [PubMed] [Google Scholar]
  65. O’Geen H, Squazzo SL, Iyengar S, Blahnik K, Rinn JL, Chang HY, Green R, Farnham PJ. Genome-Wide analysis of KAP1 binding suggests autoregulation of KRAB-znfs. PLOS Genetics. 2007;3:e89. doi: 10.1371/journal.pgen.0030089. [DOI] [PMC free article] [PubMed] [Google Scholar]
  66. Olson MV. When less is more: gene loss as an engine of evolutionary change. American Journal of Human Genetics. 1999;64:18–23. doi: 10.1086/302219. [DOI] [PMC free article] [PubMed] [Google Scholar]
  67. Pál C, Papp B, Lercher MJ. An integrated view of protein evolution. Nature Reviews. Genetics. 2006;7:337–348. doi: 10.1038/nrg1838. [DOI] [PubMed] [Google Scholar]
  68. Paradis E, Schliep K. Ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics. 2019;35:526–528. doi: 10.1093/bioinformatics/bty633. [DOI] [PubMed] [Google Scholar]
  69. Perry BW, Schield DR, Adams RH, Castoe TA, Arkhipova I. Microchromosomes exhibit distinct features of vertebrate chromosome structure and function with underappreciated ramifications for genome evolution. Molecular Biology and Evolution. 2020;38:904–910. doi: 10.1093/molbev/msaa253. [DOI] [PMC free article] [PubMed] [Google Scholar]
  70. Rangasamy D. Distinctive patterns of epigenetic marks are associated with promoter regions of mouse LINE-1 and LTR retrotransposons. Mobile DNA. 2013;4:27. doi: 10.1186/1759-8753-4-27. [DOI] [PMC free article] [PubMed] [Google Scholar]
  71. Rao SSP, Huntley MH, Durand NC, Stamenova EK, Bochkov ID, Robinson JT, Sanborn AL, Machol I, Omer AD, Lander ES, Aiden EL. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. 2014;159:1665–1680. doi: 10.1016/j.cell.2014.11.021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  72. Rice AM, McLysaght A. Dosage sensitivity is a major determinant of human copy number variant pathogenicity. Nature Communications. 2017;8:14366. doi: 10.1038/ncomms14366. [DOI] [PMC free article] [PubMed] [Google Scholar]
  73. Roux J, Liu J, Robinson-Rechavi M. Selective constraints on coding sequences of nervous system genes are a major determinant of duplicate gene retention in vertebrates. Molecular Biology and Evolution. 2017;34:2773–2791. doi: 10.1093/molbev/msx199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  74. Schaibley VM, Zawistowski M, Wegmann D, Ehm MG, Nelson MR, St Jean PL, Abecasis GR, Novembre J, Zöllner S, Li JZ. The influence of genomic context on mutation patterns in the human genome inferred from rare variants. Genome Research. 2013;23:1974–1984. doi: 10.1101/gr.154971.113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  75. Schield DR, Card DC, Hales NR, Perry BW, Pasquesi GM, Blackmon H, Adams RH, Corbin AB, Smith CF, Ramesh B, Demuth JP, Betrán E, Tollis M, Meik JM, Mackessy SP, Castoe TA. The origins and evolution of chromosomes, dosage compensation, and mechanisms underlying venom regulation in snakes. Genome Research. 2019;29:590–601. doi: 10.1101/gr.240952.118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  76. Seplyarskiy VB, Sunyaev S. The origin of human mutation in light of genomic data. Nature Reviews Genetics. 2021;22:672–686. doi: 10.1038/s41576-021-00376-2. [DOI] [PubMed] [Google Scholar]
  77. Sharma V, Hecker N, Roscito JG, Foerster L, Langer BE, Hiller M. A genomics approach reveals insights into the importance of gene losses for mammalian adaptations. Nature Communications. 2018;9:1215. doi: 10.1038/s41467-018-03667-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  78. Shen XX, Opulente DA, Kominek J, Zhou X, Steenwyk JL, Buh KV, Haase MAB, Wisecaver JH, Wang M, Doering DT, Boudouris JT, Schneider RM, Langdon QK, Ohkuma M, Endoh R, Takashima M, Manabe RI, Čadež N, Libkind D, Rosa CA, DeVirgilio J, Hulfachor AB, Groenewald M, Kurtzman CP, Hittinger CT, Rokas A. Tempo and mode of genome evolution in the budding yeast subphylum. Cell. 2018;175:1533–1545. doi: 10.1016/j.cell.2018.10.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K. DbSNP: the NCBI database of genetic variation. Nucleic Acids Research. 2001;29:308–311. doi: 10.1093/nar/29.1.308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31:3210–3212. doi: 10.1093/bioinformatics/btv351. [DOI] [PubMed] [Google Scholar]
  81. Slotkin RK, Martienssen R. Transposable elements and the epigenetic regulation of the genome. Nature Reviews Genetics. 2007;8:272–285. doi: 10.1038/nrg2072. [DOI] [PubMed] [Google Scholar]
  82. Stamatoyannopoulos JA, Adzhubei I, Thurman RE, Kryukov GV, Mirkin SM, Sunyaev SR. Human mutation rate associated with DNA replication timing. Nature Genetics. 2009;41:393–395. doi: 10.1038/ng.363. [DOI] [PMC free article] [PubMed] [Google Scholar]
  83. Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology. 2017;35:1026–1028. doi: 10.1038/nbt.3988. [DOI] [PubMed] [Google Scholar]
  84. Terekhanova NV, Seplyarskiy VB, Soldatov RA, Bazykin GA. Evolution of local mutation rate and its determinants. Molecular Biology and Evolution. 2017;34:msx060. doi: 10.1093/molbev/msx060. [DOI] [PMC free article] [PubMed] [Google Scholar]
  85. The ENCODE Project Consortium An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489:57–74. doi: 10.1038/nature11247. [DOI] [PMC free article] [PubMed] [Google Scholar]
  86. The GTEx Consortium. Aguet F, Anand S, Ardlie KG, Gabriel S, Getz GA, Graubert A, Hadley K, Handsaker RE, Huang KH, Kashin S, Li X, MacArthur DG, Meier SR, Nedzel JL, Nguyen DT, Segrè AV, Todres E, Balliu B, Barbeira AN, Battle A, Bonazzola R, Brown A, Brown CD, Castel SE, Conrad DF, Cotter DJ, Cox N, Das S, de Goede OM, Dermitzakis ET, Einson J, Engelhardt BE, Eskin E, Eulalio TY, Ferraro NM, Flynn ED, Fresard L, Gamazon ER, Garrido-Martín D, Gay NR, Gloudemans MJ, Guigó R, Hame AR, He Y, Hoffman PJ, Hormozdiari F, Hou L, Im HK, Jo B, Kasela S, Kellis M, Kim-Hellmuth S, Kwong A, Lappalainen T, Li X, Liang Y, Mangul S, Mohammadi P, Montgomery SB, Muñoz-Aguirre M, Nachun DC, Nobel AB, Oliva M, Park Y, Park Y, Parsana P, Rao AS, Reverter F, Rouhana JM, Sabatti C, Saha A, Stephens M, Stranger BE, Strober BJ, Teran NA, Viñuela A, Wang G, Wen X, Wright F, Wucher V, Zou Y, Ferreira PG, Li G, Melé M, Yeger-Lotem E, Barcus ME, Bradbury D, Krubit T, McLean JA, Qi L, Robinson K, Roche NV, Smith AM, Sobin L, Tabor DE, Undale A, Bridge J, Brigham LE, Foster BA, Gillard BM, Hasz R, Hunter M, Johns C, Johnson M, Karasik E, Kopen G, Leinweber WF, McDonald A, Moser MT, Myer K, Ramsey KD, Roe B, Shad S, Thomas JA, Walters G, Washington M, Wheeler J, Jewell SD, Rohrer DC, Valley DR, Davis DA, Mash DC, Branton PA, Barker LK, Gardiner HM, Mosavel M, Siminoff LA, Flicek P, Haeussler M, Juettemann T, Kent WJ, Lee CM, Powell CC, Rosenbloom KR, Ruffier M, Sheppard D, Taylor K, Trevanion SJ, Zerbino DR, Abell NS, Akey J, Chen L, Demanelis K, Doherty JA, Feinberg AP, Hansen KD, Hickey PF, Jasmine F, Jiang L, Kaul R, Kibriya MG, Li JB, Li Q, Lin S, Linder SE, Pierce BL, Rizzardi LF, Skol AD, Smith KS, Snyder M, Stamatoyannopoulos J, Tang H, Wang M, Carithers LJ, Guan P, Koester SE, Little AR, Moore HM, Nierras CR, Rao AK, Vaught JB, Volpi S. The gtex Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369:1318–1330. doi: 10.1126/science.aaz1776. [DOI] [PMC free article] [PubMed] [Google Scholar]
  87. Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. Differential gene and transcript expression analysis of RNA-Seq experiments with tophat and cufflinks. Nature Protocols. 2012;7:562–578. doi: 10.1038/nprot.2012.016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  88. Underwood CJ, Henderson IR, Martienssen RA. Genetic and epigenetic variation of transposable elements in Arabidopsis. Current Opinion in Plant Biology. 2017;36:135–141. doi: 10.1016/j.pbi.2017.03.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  89. van Schaik T, Vos M, Peric-Hupkes D, Hn Celie P, van Steensel B. Cell cycle dynamics of lamina-associated DNA. EMBO Reports. 2020;21:e50636. doi: 10.15252/embr.202050636. [DOI] [PMC free article] [PubMed] [Google Scholar]
  90. van Steensel B, Belmont AS. Lamina-Associated domains: links with chromosome architecture, heterochromatin, and gene repression. Cell. 2017;169:780–791. doi: 10.1016/j.cell.2017.04.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
  91. Vogel MJ, Guelen L, de Wit E, Peric-Hupkes D, Lodén M, Talhout W, Feenstra M, Abbas B, Classen A-K, van Steensel B. Human heterochromatin proteins form large domains containing KRAB-ZNF genes. Genome Research. 2006;16:1493–1504. doi: 10.1101/gr.5391806. [DOI] [PMC free article] [PubMed] [Google Scholar]
  92. Waters PD, Patel HR, Ruiz-Herrera A, Álvarez-González L, Lister NC, Simakov O, Ezaz T, Kaur P, Frere C, Grützner F, Georges A, Graves JAM. Microchromosomes are building blocks of bird, reptile, and mammal chromosomes. PNAS. 2021;118:e2112494118. doi: 10.1073/pnas.2112494118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  93. Xie KT, Wang G, Thompson AC, Wucherpfennig JI, Reimchen TE, MacColl ADC, Schluter D, Bell MA, Vasquez KM, Kingsley DM. Dna fragility in the parallel evolution of pelvic reduction in stickleback fish. Science. 2019;363:81–84. doi: 10.1126/science.aan1425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  94. Yang J, Gu Z, Li WH. Rate of protein evolution versus fitness effect of gene deletion. Molecular Biology and Evolution. 2003;20:772–774. doi: 10.1093/molbev/msg078. [DOI] [PubMed] [Google Scholar]
  95. Yang Z. PAML 4: phylogenetic analysis by maximum likelihood. Molecular Biology and Evolution. 2007;24:1586–1591. doi: 10.1093/molbev/msm088. [DOI] [PubMed] [Google Scholar]
  96. Yoshihara M, Jiang L, Akatsuka S, Suyama M, Toyokuni S. Genome-Wide profiling of 8-oxoguanine reveals its association with spatial positioning in nucleus. DNA Research. 2014;21:603–612. doi: 10.1093/dnares/dsu023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  97. Zhang J, Yang JR. Determinants of the rate of protein sequence evolution. Nature Reviews. Genetics. 2015;16:409–420. doi: 10.1038/nrg3950. [DOI] [PMC free article] [PubMed] [Google Scholar]
  98. Zheng X, Hu J, Yue S, Kristiani L, Kim M, Sauria M, Taylor J, Kim Y, Zheng Y. Lamins organize the global three-dimensional genome from the nuclear periphery. Molecular Cell. 2018;71:802–815. doi: 10.1016/j.molcel.2018.05.017. [DOI] [PMC free article] [PubMed] [Google Scholar]

Editor's evaluation

Wenfeng Qian 1

The study provides a fundamental understanding of the driving forces behind gene losses in genome evolution and connects the propensity for gene losses to local genomic features like mutation rate and expression pattern. The methodology is compelling, as it identifies "elusive human genes" through independent gene losses in at least two mammalian lineages. The comparative genomics and statistical analyses are thorough and rigorous, making this study appealing to readers interested in exploring the global patterns and underlying mechanisms of gene fate evolution across the phylogenetic tree.

Decision letter

Editor: Wenfeng Qian1
Reviewed by: Wenfeng Qian2

Our editorial process produces two outputs: (i) public reviews designed to be posted alongside the preprint for the benefit of readers; (ii) feedback on the manuscript for the authors, including requests for revisions, shown below. We also include an acceptance summary that explains what the editors found interesting or important about the work.

Decision letter after peer review:

Thank you for submitting your article "Gene fate spectrum as a reflection of local genomic properties" for consideration by eLife. Your article has been reviewed by 3 peer reviewers, including Wenfeng Qian as the Reviewing Editor and Reviewer #1, and the evaluation has been overseen by George Perry as the Senior Editor.

The reviewers have discussed their reviews with one another, and the Reviewing Editor has drafted this to help you prepare a revised submission.

Essential revisions:

1) The identification of "elusive genes" in the current manuscript requires additional scrutinization, given it is the foundation of the whole study. Please check published pipelines in the identification of gene losses (e.g., TOGA – https://github.com/hillerlab/TOGA) and use additional tools such as BLASTX search to test for known technical artifacts when calling genes (homology detection failure, refer to https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000862). Please give a reason for the parameters used in the analysis (e.g., CD-Hit clustering) or examine if the conclusions remain supported by various parameters in the computational pipeline. Also, please take a look at the enrichment of elusive genes in human chr19, and use the synteny-based age estimation of the elusive genes (Shao et al. 2019).

2) Please also present the features of other genes (the genomic background other than elusive and non-elusive genes). Are these genes show intermediate patterns between those of elusive and non-elusive genes? Please edit the manuscript accordingly if the definition of non-elusiveness is actually equivalent to the genomic background.

3) It would be informative to test the links between recombination rate / LD and the genomic locations of elusive genes (compared against randomly sampled genes).

4) Please control confounding factors such as gene expression level and confirm whether the proxy of mutation rate (i.e., Ks) is actually confounded by gene importance.

5) Please consider extending the analyses on fish and birds to other genomic features.

6) The authors should reconcile the findings in this study with previous reports about microchromosomes.

7) Please consider improving the clarity/presentation of Figures 5 and 7 and examine whether the pattern remains robust using various parameter sets.

8) Please think of a better term than "elusive gene" to describe the genes that were lost independently in different lineages. Please also clearly define other terms in the manuscript (e.g., functionally indispensable vs. importance, are they the same concept?)

9) Please consider presenting the current study in the framework of mutation-selection balance, and better explain the novelty of the study over previous tremendous studies about gene losses.

Reviewer #1 (Recommendations for the authors):

Line 18. "neutral factor" is better replaced by "factors independent of gene dispensability".

Line 47. "However" should be "on the contrary"?

Lines 113-114. As indicated in the weaknesses part, I am not fully convinced these genomic, epigenomic, and transcriptomic features are completely independent of gene function.

Figure 1b. Define the red and orange crosses in the legend.

Figure 7 appeared first in the Discussion section. Can this part be moved to the Results section?

Reviewer #2 (Recommendations for the authors):

Overall, I believe that this is an interesting study. However, this version of the manuscript could be significantly improved in terms of logical depth and methodological stringency.

1. Authors actually support the concept of mutation-driven evolution, i.e., the high mutation rate in genomic regions harboring the elusive genes would predispose their fate toward death. To increase the significance of their work, I suggest authors cite (Nei 2013; Xie et al. 2019) and put their work in a bigger context.

2. Authors mentioned that elusive genes are less important and thus more prone to loss. In my view, pleiotropy is a better term compared to importance. That is, elusive genes are less pleiotropic [e.g. narrowly expressed, Figure 5] and thus their loss are more tolerable or easily compensated by other genes. Actually, narrow expression breadth has been observed to be correlated with gene loss in both humans and flies (MacArthur et al. 2012; Yang et al. 2015).

3. I am generally convinced that authors reliably identified elusive genes by identifying gene loss events in the common ancestor of multiple descendant species (to control for errors induced by assembly or annotation, Figure 1). However, Figure 7 shows the enrichment of elusive genes in human chr19. This chromosome is well known to be enriched with tandemly duplicated Krueppel-associated box C2H2 zinc-finger protein family (KZNF), many of which are primate-specific (Shao et al. 2019). I suspect that tree-based strategy implemented in Figure 1 could not be able to dissect the evolution of this super complex gene family. I am proposing two specific analyses: how many elusive genes encoded by chr19 are KZNFs? how many of them have Ensembl one-to-one orthologs across mammals?

4. With the patterns in Figure 3 and 7, authors argued that features of elusive genes are deeply ancient and could be inherited from the microchromosomes of early vertebrates. This statement has multiple problems.

a) Figure 3 only show genomic level features (e.g., high GC content) conserved in multiple vertebrates including shark and chicken. Epigenetic features analyzed in Figure 5 to 6 were only based on human data. I suggest authors to extend these analyses to shark or chicken. Although some epigenetic data could not be available for these species, transcriptome data analyzed in Figure 5 should be available for at least some species.

b) In Figure 7, authors propose the concept related with microchromosomes. These chromosomes have been extensively studied, especially in birds. Some features of microchromosomes are consistent with that of elusive genes [e.g., high GC, (Bravo et al. 2021)]. However, microchromosomes are conserved in terms of gene order and their genes generally show high protein-level constraints as shown by low Ka/Ks (Waters et al. 2021; Li et al. 2022). Authors need to reconcile their discovery with the previous rich literature.

c) Line (L) 204, among 982 human elusive genes, only 540~390 are shared by other species (e.g., shark). I suggest taking advantage of genome-level synteny based age data generated in Shao et al. 2019 to examine the age distribution of human elusive genes. If a high proportion of them are dated as being old (e.g., shared by jawed vertebrates), the statement that these genes have an ancient origin could be better supported.

References

Bravo GA, Schmitt CJ, Edwards SV. 2021. What have we learned from the first 500 avian genomes. Annu Rev Ecol Evol Syst 52: 611-639.

Li M, Sun C, Xu N, Bian P, Tian X, Wang X, Wang Y, Jia X, Heller R, Wang M et al. 2022. de novo Assembly of 20 Chicken Genomes Reveals the Undetectable Phenomenon for Thousands of Core Genes on Microchromosomes and Subtelomeric Regions. Mol Biol Evol 39.

MacArthur DG, Balasubramanian S, Frankish A, Huang N, Morris J, Walter K, Jostins L, Habegger L, Pickrell JK, Montgomery SB et al. 2012. A systematic survey of loss-of-function variants in human protein-coding genes. Science 335: 823-828.

Nei M. 2013. Mutation-driven evolution. OUP Oxford.

Shao Y, Chen C, Shen H, He BZ, Yu D, Jiang S, Zhao S, Gao Z, Zhu Z, Chen X et al. 2019. GenTree, an integrated resource for analyzing the evolution and function of primate-specific coding genes. Genome Res 29: 682-696.

Waters PD, Patel HR, Ruiz-Herrera A, Alvarez-Gonzalez L, Lister NC, Simakov O, Ezaz T, Kaur P, Frere C, Grutzner F et al. 2021. Microchromosomes are building blocks of bird, reptile, and mammal chromosomes. Proc Natl Acad Sci U S A 118.

Xie KT, Wang G, Thompson AC, Wucherpfennig JI, Reimchen TE, MacColl AD, Schluter D, Bell MA, Vasquez KM, Kingsley DM. 2019. DNA fragility in the parallel evolution of pelvic reduction in stickleback fish. Science 363: 81-84.

Yang H, He BZ, Ma H, Tsaur SC, Ma C, Wu Y, Ting CT, Zhang YE. 2015. Expression profile and gene age jointly shaped the genome-wide distribution of premature termination codons in a Drosophila melanogaster population. Mol Biol Evol 32: 216-228.

Reviewer #3 (Recommendations for the authors):

– In recent years several gene loss pipelines were already published (e.g. TOGA – https://github.com/hillerlab/TOGA) and it would be highly beneficial for this study to compare their gene loss reports with output obtained from existing pipelines (which also address the false discovery rate issue)

– We did not appreciate Figure 5. We strongly recommend finding a more quantitative approach to visualise these results, since heat maps are misleading and show different x-axis and y-axis ranges (zoom-ins/ zoom-outs)

– All Figures and Supplementary Figures showing violin plots need to report the number of genes that underly these distributions.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Thank you for resubmitting your work entitled "Gene fate spectrum as a reflection of local genomic properties" for further consideration by eLife. Your revised article has been evaluated by George Perry (Senior Editor) and a Reviewing Editor. A previous reviewer also read and commented on the revised manuscript.

Apparently, the manuscript has been significantly improved but there are some remaining issues that need to be addressed, as outlined below. In view of these comments, we kindly request that you consider revising the manuscript once more. Our hope is that, through this additional revision, the manuscript will be written more clearly and rigorously. Thank you for your understanding and continued efforts in improving your submission.

Reviewer #2 made the following comment. Please consider it during revision.

"The authors attempt to argue that the elusive status is ancient ("Thus, the heterogeneous genomic features driving gene fates toward loss have been in place since the ancestral vertebrates"). However, in response to my previous suggestion regarding chicken microchromosomes, the authors present mixed results. They observed high GC content, high gene density, and short gene length in chicken, similar to the findings in humans (Figure 3). Yet, the critical functional data between the two species are conflicting: human elusive genes exhibit low expression and fewer ATAC-seq peaks, while their chicken counterparts display the opposite pattern. In other words, chicken elusive genes exhibit higher pleiotropy, which may decrease the likelihood of their loss. Thus, these genes are not elusive, and high GC content, high gene density, and short gene length do not necessarily predict elusiveness. Given that the authors only analyzed the functional data of human and chicken genomes, it is not possible to determine whether the "elusive" status is ancient or derived from the human or mammalian lineage. I suggest that the authors analyze the transcriptome data in shark or spotted gar to provide further phylogenetic context. Otherwise, the authors should significantly tone down their statement."

The reviewing editor also has some comments on the title and abstract.

1. Please consider revising the title to "The Impact of Local Genomic Properties on the Evolutionary Fate of Genes Across Vertebrates".

2. Please consider adding a sentence (or something similar) at the end of the abstract. "This study sheds light on the complex interplay between gene function and local genomic properties in shaping gene evolution across vertebrate lineages."

eLife. 2023 May 24;12:e82290. doi: 10.7554/eLife.82290.sa2

Author response


Essential revisions:

1) The identification of "elusive genes" in the current manuscript requires additional scrutinization, given it is the foundation of the whole study. Please check published pipelines in the identification of gene losses (e.g., TOGA – https://github.com/hillerlab/TOGA) and use additional tools such as BLASTX search to test for known technical artifacts when calling genes (homology detection failure, refer to https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.3000862). Please give a reason for the parameters used in the analysis (e.g., CD-Hit clustering) or examine if the conclusions remain supported by various parameters in the computational pipeline. Also, please take a look at the enrichment of elusive genes in human chr19, and use the synteny-based age estimation of the elusive genes (Shao et al. 2019).

Based on this suggestion, we have repeated the analysis, and to more rigorously exclude possible false positives, and as a result, we have obtained a refined set of 813 elusive genes. The validation of gene loss has employed the ortholog annotations implemented in the RefSeq gene prediction pipeline, a synteny-based ortholog clustering platform other than TOGA, as well as similarity sequence search by MMSeqs2 that performs superior to NCBI BLAST.

For the revised manuscript, we have refined the elusive gene set as the reviewer suggested. In the genome assemblies, we have searched for the orthologs of the elusive genes for the species in which they were missing. The search has been conducted by querying amino acid sequences of the elusive genes with tblastn as well as MMSeqs2 that performed superior to tblastn in sensitivity and computational speed. In addition, regarding a comment by Reviewer 3 we have searched for the orthologs by referring to existing ortholog annotations. We used the ortholog annotations implemented in RefSeq instead of those from the TOGA pipeline: both employ synteny conservation. We have coordinated the identified orthologs with our gene loss criteria–absence from all the species used in a particular taxon–and excluded 268 genes from the original elusive gene set. These genes contain those missing in the previous gene annotations used in the original manuscript but present in the latest ones, as well as those falsely missing due to incorrect inference of gene trees. Finally, the refined set of 813 elusive genes were subject to comparisons with the non-elusive genes. Importantly, these comparisons retained the significantly different trends of the particular genomic, transcriptomic, and epigenomic features between them except for very few cases (Author response table 1). This indicates that both initial and revised sets of the elusive genes reflect the nature of the ‘elusiveness,’ though the initial set contained some noises. We have modified the numbers of elusive genes in the corresponding parts of the manuscript including figures and tables. Additionally, we have added the validation procedures in Methods.

Author response table 1. Difference in statistical significances across different elusive gene sets.

Features Non-significant in theinitial gene set(1,081 elusive genes) Non-significant in thecurrent gene set(813) Non-significant in the current gene setexcluding chr19(669)
Gene density in the turkey genome
Gene density in the green anole genome
Gene density in the bamboo shark genome
Gene density in the whale shark genome
KS in avians
KA in avians
KA in sharks
ATAC-seq peak density for GM23338
Lamin B1 ChIP-seq peak density for K562

The other features showed significantly different trends between the elusive and non-elusive genes for all of the elusive gene sets and thus are not included in this table.

cd-hit clustering with 100% sequence identity only clusters those with identical (and sometimes truncated) sequences, and, in the cluster, the sequences other than the representative are discarded. This means that the sequences remain if they are not identical to the other ones. If the similarity threshold is lowered, both identical and highly similar sequences are clustered with each other, and more sequences are discarded. Therefore, our approach that employs clustering with 100% similarity may minimize false positive gene loss.

Please refer to the reply to [Rev2 point 3] for the abundance of the elusive genes on chromosome 19. To examine this point, we have performed the comparisons between the elusive and non-elusive genes excluding the genes on chromosome 19, and the characteristics remained unchanged even when chromosome 19 was excluded.

2) Please also present the features of other genes (the genomic background other than elusive and non-elusive genes). Are these genes show intermediate patterns between those of elusive and non-elusive genes? Please edit the manuscript accordingly if the definition of non-elusiveness is actually equivalent to the genomic background.

Our aim in this study is to extract the characteristics of the genes that differentiate their fates from retention to loss. To achieve this, we compared the genes with clearly different phylogenetic signatures for gene loss, namely elusive and non-elusive genes.

The remainders excluding the elusive and non-elusive genes do not necessarily exhibit intermediate features. Because our definitions of the elusive and non-elusive genes were stringent, the remainders may contain considerable numbers of genes with loss in more restricted taxa than our criterion. In addition, the reminders contain those with other particular phylogenetic signals (e.g., frequently duplicated). These do not necessarily exhibit intermediate features, and at least the former is rather closer to elusive genes.

3) It would be informative to test the links between recombination rate / LD and the genomic locations of elusive genes (compared against randomly sampled genes).

We have retrieved fine-scale recombination rate data of males and females from https://www.decode.com/addendum/ (Suppl. Data of Kong, A et al., Nature, 467:1099–1103, 2010) and have compared them between the gene regions of the elusive and non-elusive genes. Both comparisons show no significant differences: average 0.829 and 0.900 recombinations/kb for the elusive and non-elusive genes, respectively, p=0.898, for males; average 0.836 and 0.846 recombinations/kb for the elusive and non-elusive genes, respectively, p=0.256, for females.

4) Please control confounding factors such as gene expression level and confirm whether the proxy of mutation rate (i.e., Ks) is actually confounded by gene importance.

We thank the reviewer for this important comment. We totally agree that transcriptomic and epigenomic features cannot be easily distinguished from gene dispensability and do not think that these features of the elusive genes can be explained solely by intrinsic properties of the genomes. Our motivation for investigating the expression profiles of the elusive gene is to understand how they lost their functional indispensability (original manuscript L285-286 in Results). We also discussed the possibility that sequence composition and genomic location of elusive genes may be associated with epigenetic features for expression depression, which may result in a decrease of functional constraints (original manuscript L470-474 in Discussion). Nevertheless, we think that the original manuscript may have contained misleading wordings, and thus we have edited them to better convey our view that gene expression and epigenomic features are related to gene function.

(P.2, Introduction) “This evolutionary fate of a gene can also be affected by factors independent of gene dispensability, including the mutability of genomic positions, but such features have not been examined well.”

(P6, Introduction) “These data assisted us to understand how intrinsic genomic features may affect gene fate, leading to gene loss by decreasing the expression level and eventually relaxing the functional importance of ʻelusiveʼ genes.”

(P33, Discussion) “Another factor is the spatiotemporal suppression of gene expression via epigenetic constraints. Previous studies showed that lowly expressed genes reduce their functional dispensability (Cherry, 2010; Gout et al., 2010), and so do the elusive genes.”

Additionally, responding to the advices from Reviewers 1 and 2, we have added a new section Elusive gene orthologs in the chicken microchromosomes in which we describe the relationship between the elusive genes and chicken microchromosomes. In this section, we also argue for the relationship between the genomic feature of the elusive genes and their transcriptomic and epigenomic characteristics. In the chicken genome, elusive genes did not show reduced pleiotropy of gene expression nor the epigenetic features relevant with the reduction, consistently with the moderation of nucleotide substitution rates. This also suggests that the relaxation of the ‘elusiveness’ is associated with the increase of functional indispensability.

(P27, Elusive gene orthologs in the chicken microchromosomes in Results) “Our analyses indicates that the genomic features of the elusive genes such as high GC and high nucleotide substitutions do not always correlate with a reduction in pleiotropy of gene expression that potentially leads to an increase in functional dispensability, although these features have been well conserved across vertebrates. In addition, the avian orthologs of the elusive genes did not show higher KA and KS values than those of the non-elusive genes (Figure 3; Figure 3—figure supplement 1), likely consistent with similar expression levels between them (Figure 5—figure supplement 1) (Cherry, 2010; Zhang and Yang, 2015). With respect to the chicken genome, the sequence features of the elusive genes themselves might have been relaxed during evolution.”

Also, please refer to [Rev1- point2] for the mutability of the elusive genes. To examine this point, we have computed nucleotide sequence differences in introns, namely KI, between the human and chimpanzee genomes. This analysis revealed higher KI values in the elusive genes than in the non-elusive genes, which is in line with our original hypothesis. The results have been added in the revised manuscript.

5) Please consider extending the analyses on fish and birds to other genomic features.

Please refer to the reply to [Rev2 point 4a] for the comparison between the elusive and non-elusive gene orthologs of nonmammalian vertebrates. We analyzed expression profiles and ATAC-seq peak densities of the elusive and non-elusive gene of these animals. We have created a new section Elusive gene orthologs in the chicken microchromosomes in Results and described the results in this section and Discussion.

We analyzed expression profiles and ATAC-seq peak densities of the elusive and non-elusive gene of these animals. We have created a new section Elusive gene orthologs in the chicken microchromosomes in Results and described the results in this section and Discussion.

We appreciate the reviewer for this meaningful suggestion. As a response, we have computed the differences in intron sequences between the human and chimpanzee genomes and compared them between the elusive and non-elusive genes. As expected, we found larger sequence differences in introns for the elusive genes than for the non-elusive genes. In Figure 2c of the revised manuscript, we have included the distribution of KI, sequence differences in introns between the human and chimpanzee genomes for the elusive and non-elusive genes. Additionally, we have added the corresponding texts to Results and the procedure to Methods as shown below.

(P11, Identification of human ‘elusive’ genes in Results) “In addition, we computed nucleotide substitution rates for introns (KI) between human and chimpanzee (Pan troglodytes) orthologs and compared them between the elusive and non-elusive genes.”

(P11, Identification of human ‘elusive’ genes in Results) “Our analysis further illuminated larger KS and KI values for the elusive genes than in the non-elusive genes (Figure 2b, c; Figure 2—figure supplement 1). Importantly, the higher rate of synonymous and intronic nucleotide substitutions, which may not affect changes in amino acid residues, indicates that the elusive genes are also susceptible to genomic characteristics independent of selective constraints on gene functions.”

(P39, Methods) “To compute nucleotide sequence differences of the individual introns, we extracted 473 elusive and 4,626 non-elusive genes that harbored introns aligned with the chimpanzee genome assembly. The nucleotide differences were calculated via the whole genome alignments of hg38 and panTro6 retrieved from the UCSC genome browser.”

6) The authors should reconcile the findings in this study with previous reports about microchromosomes.

Please refer to the reply to [Rev2-point 4b] for the reconciliation of our findings of the elusive genes with the features of microchromosomes. In the human genome, we found a significant overlap between the elusive genes and the genes whose chicken orthologs are located on microchromosomes. Although the chicken microchromosomes shared some sequence features such as high GC content and high KS values in common with the elusive genes in our sense but exhibited opposite transcriptomic and epigenetic trends. Although the result does not change the basis of our study in the human genome, it indicates that, in the course of evolution, genomic features of the elusive genes are not always associated with a reduction of pleiotropy of gene expression. The results have been described in the newly created section Elusive gene orthologs in the chicken microchromosomes in Results.

7) Please consider improving the clarity/presentation of Figures 5 and 7 and examine whether the pattern remains robust using various parameter sets.

Please refer to the replies to [Rev3- point 2] for this point. First, following this suggestion, we have conducted a statistical test to see whether the elusive genes contain more genes with restricted expression profiles (H’ < 1) than the non-elusive genes, and this trend was statistically supported. We have added the gene numbers of the individual categories and the result of this statistical tests to Figure 5. Therefore, we think the classification of the elusive genes based on the threshold (H’ = 1) is reasonable in Figure 7.

In addition, we have conducted statistical tests in a similar way with different thresholds for H’ (2, 3, and 0.5) and found that the pattern remains robust (p = 3.85x10-67, 1.29x10-50, and 9.40x10-46 setting thresholds H’ at 2, 3, and 0.5, respectively, for the GTEx dataset and p = 8.35x10-61, 2.01x1061, and 1.25x10-24 setting thresholds H’ at 2, 3, and 0.5, respectively, for the Descartes dataset).

To use Figure 7 in a new section in Results, we have added an ideogram showing the distribution of the genes that retain the chicken orthologs in microchromosomes. In response to the comment by Reviewer 2 [Rev2- point 4b], we have performed statistical tests and found that the elusive genes were significantly more abundant in orthologs in microchromosomes than the non-elusive genes. Furthermore, the observation that the elusive genes prefer to be located in gene-rich regions was already statistically supported (Figure 2f).

As shown in Figure 5, Shannon’s H' ranged from zero to approximately 4 (exact maximum value is 3.97) and 5 (5.11) for the GTEx and Descartes gene expression datasets, respectively. Although the threshold H'=1 was an arbitrarily set, we think that it is reasonable to classify the genes with high pleiotropy from those with low pleiotropy.

8) Please think of a better term than "elusive gene" to describe the genes that were lost independently in different lineages. Please also clearly define other terms in the manuscript (e.g., functionally indispensable vs. importance, are they the same concept?)

The phrase ‘elusive gene’ was already used in our previous paper to follow the advice of peer-reviewers for our past manuscripts, and for consistency, we would like use this term, while we have modified the sentence introducing this term to more carefully explain it.

9) Please consider presenting the current study in the framework of mutation-selection balance, and better explain the novelty of the study over previous tremendous studies about gene losses.

Please refer to the replies below to convey our novelty. We have cited existing literature in the revised manuscript.

Reviewer #2 (Recommendations for the authors): Overall, I believe that this is an interesting study. However, this version of the manuscript could be significantly improved in terms of logical depth and methodological stringency.

1. Authors actually support the concept of mutation-driven evolution, i.e., the high mutation rate in genomic regions harboring the elusive genes would predispose their fate toward death. To increase the significance of their work, I suggest authors cite (Nei 2013; Xie et al. 2019) and put their work in a bigger context.

We appreciate the reviewer for considering a way to enhance the significance of our study. We now recognize the importance of the study by Xie et al. (2019) reporting the recurrent loss of an enhancer that modulates fin morphogenesis in stickleback. As suggested, in the revised manuscript, we have cited this paper in Introduction.

(P4, Introduction) “In the stickleback genome, a Pitx1 enhancer was independently lost in multiple lineages inhabiting freshwater due to its genomic location in a structurally fragile site, leading to recurrent loss of pelvic fins (Xie et al., 2019). Genes and genomic elements in such particular regions may be prone to loss in a more neutral manner than the relaxation of functional importance or via functional adaptations.”

Additionally, to enhance the broad interests of our study, we have cited the Dr. Nei’s book in Discussion as shown below.

(P31, Discussion) “Close coordination of the studies on gene evolution with germline mutation repertoires and spectra, which can be approximated from the collection of de novo mutations obtained by trio sequencing, may further facilitate our understanding of gene fates driven by heterogeneous genomic features—this would be viewed as ‘mutation-driven’ evolution (Nei, 2013).”

2. Authors mentioned that elusive genes are less important and thus more prone to loss. In my view, pleiotropy is a better term compared to importance. That is, elusive genes are less pleiotropic [e.g. narrowly expressed, Figure 5] and thus their loss are more tolerable or easily compensated by other genes. Actually, narrow expression breadth has been observed to be correlated with gene loss in both humans and flies (MacArthur et al. 2012; Yang et al. 2015).

We thank the reviewer for the thoughtful suggestion. We agree that the word pleiotropy is more suitable in our manuscript. We have modified the manuscript as shown below.

(P20, Transcriptomic natures of elusive genes in Results) “Our findings demonstrate that some elusive genes harbor low-level and spatially-restricted expression profiles, i.e., less pleiotropic states, which are rarely observed in the non-elusive genes.”

(P23, Epigenetic nature of elusive genes in Results) “… we classified the elusive genes into two groups based on the pleiotropy in terms of gene expressions: that is, 589 elusive genes with Shannon’s diversity index H¢ ≥ 1 were ubiquitously expressed, i.e, more pleiotropic, and 150 of those with H¢ < 1 were expressed in only a few or none of the tissues examined, i.e., less pleiotropic (Figure 5).”

(P34, Discussion) “Elusive genes with reduced pleiotropy may have limited opportunities to function, potentially leading to loss of their important roles.”

3. I am generally convinced that authors reliably identified elusive genes by identifying gene loss events in the common ancestor of multiple descendant species (to control for errors induced by assembly or annotation, Figure 1). However, Figure 7 shows the enrichment of elusive genes in human chr19. This chromosome is well known to be enriched with tandemly duplicated Krueppel-associated box C2H2 zinc-finger protein family (KZNF), many of which are primate-specific (Shao et al. 2019). I suspect that tree-based strategy implemented in Figure 1 could not be able to dissect the evolution of this super complex gene family. I am proposing two specific analyses: how many elusive genes encoded by chr19 are KZNFs? how many of them have Ensembl one-to-one orthologs across mammals?

Responding to this comment, we investigated the KZNF genes on chromosome 19. We identified 206 those genes in chromosome 19, of which 75 were found to be elusive. Of these, 30 genes retained one-to-one orthologs of the mouse or dog. Although we excluded the nearly identical paralogs in our pipeline in the original manuscript and have refined the elusive gene set with synteny-based ortholog annotation in the current manuscript, the elusive KZNF genes have remained in this refined gene set. The elusive KZNF genes on chromosome 19 were lost in some mammalian lineages and duplicated in early primates.

Motivated by this comment, we conceived a possibility whether the enrichment of the paralogs of the elusive genes on chromosome 19 overrepresents the features relevant to the ‘elusiveness’. We thus have created another set of elusive genes, those excluding the genes on chromosome 19, and performed comparisons on the genomic, transcriptomic, and epigenomic features of the elusive and non-elusive genes. The results showed that the significant/non-significant differences have been maintained except for a few cases (see Author response table 1), indicating that the characteristics remained unchanged even when chromosome 19 is excluded.

4. With the patterns in Figure 3 and 7, authors argued that features of elusive genes are deeply ancient and could be inherited from the microchromosomes of early vertebrates. This statement has multiple problems.

a) Figure 3 only show genomic level features (e.g., high GC content) conserved in multiple vertebrates including shark and chicken. Epigenetic features analyzed in Figure 5 to 6 were only based on human data. I suggest authors to extend these analyses to shark or chicken. Although some epigenetic data could not be available for these species, transcriptome data analyzed in Figure 5 should be available for at least some species.

In response to this reviewer’s comment, we have investigated the transcriptomic and epigenetic characteristics of the orthologs of the elusive genes in non-mammalian vertebrates as we had done for human in the original manuscript. We have retrieved gene expression profiles of normal tissues and early embryos from bulk RNA-seq data of chicken, tropical clawed frog, coelacanth, and spotted gar, respectively, from the Bgee database (https://bgee.org/). We have then compared expression profiles between the orthologs of the elusive and non-elusive genes for the individual species using the same procedure as those in the manuscript. Only the anole, coelacanth, and gar orthologs of the elusive genes show the enrichment of low H’ values (newly created figure in Figure 5—figure supplement 1). This indicates that the low pleiotropy of expression of the elusive genes is not always observed in the non-mammalian species.

Furthermore, we have compared epigenomic properties between the orthologs of the elusive and non-elusive genes in the chicken genome. We have retrieved ATAC-seq narrow peak data for tissues of chicken embryos from NCBI GEO and compared the density of ATAC peaks between the orthologs of the elusive and non-elusive genes. The result indicates that, in five samples out of the eight, the orthologs of the elusive genes retained more ATAC peaks than those of the non-elusive genes, and that the reminders did not show this difference (newly created figure in Figure 6—figure supplement 5). This observation may remind us of a link between the reduction of the ‘elusiveness’ and the decrease of functional dispensability in gene evolution. However, it should be interpreted carefully, as different sets of tissues were used for the transcriptomic and epigenomic analyses between human and chicken. As described above, the chicken ATAC-seq experiments were mainly performed with developing embryos, while the human ATAC-seq used in our study were performed with cell lines. Nevertheless, the cross-species comparison of the epigenetic features suggests that the sequence features relevant to the elusive genes do not always induce the epigenetic conditions for gene expression depletion.

We have newly created Figure 5—figure supplement 1 and Figure 6—figure supplement 5 as shown below, described the results in the newly created section Elusive gene orthologs in the chicken microchromosomes in Results, and discussed them in Discussion. Also, we described the procedures in Methods.

(P-26-27, Elusive gene orthologs in the chicken microchromosomes in Results)

“Elusive gene orthologs in the chicken microchromosomes

The heterogeneous locations of the elusive genes can also be examined from a chromosome-scale viewpoint (Figure 7; Figure 7—figure supplement 1). The visualization via chromosome ideograms indicated an overlap of the elusive genes with the genomic regions enriched for the genes whose chicken orthologs are on the microchromosomes (chromosomes 11-38 and W), providing a statistical support for this trend (p=0.0175; Figure 7a). Indeed, microchromosomes of the chicken and other vertebrate exhibit genomic features including high GC-content, high gene density, and rapid nucleotide substitutions in comparison with their macrochromosomes (Groenen et al., 2009; International Chicken Genome Sequencing Consortium, 2004; Schield et al., 2019; Waters et al., 2021), which also characterize genomic regions containing elusive genes. On the contrary, previous studies revealed that the chicken microchromosomes are preferentially located in the A compartments of the nucleus (Perry et al., 2020) and are early replicating (McQueen et al., 1998). These characteristics associated with the microchromosomes were opposite characteristics to the human genomic regions preferentially containing the elusive genes.

We further analyzed the ATAC-seq peaks in the chicken genome and found more peaks in the genomic regions including the elusive gene orthologs than in those containing non-elusive gene orthologs in four samples out of eight and no significant differences in the peak density in the four remaining samples (Figure 6; Figure 6—figure supplement 5). These observations indicate that, in an epigenetic manner, the chicken orthologs of the elusive genes are not regulated to reduce their expression level. This idea was further supported by a comparison of the expression profiles between the chicken orthologs of the elusive and non-elusive genes, showing no significant differences between them (Figure 5; Figure 5—figure supplement 1). Our analyses indicate that the genomic features of the elusive genes such as high GC and high nucleotide substitutions do not always correlate with a reduction in pleiotropy of gene expression that potentially leads to an increase in functional dispensability in the course of vertebrate evolution. In addition, avian orthologs of the elusive genes did not show higher KA and KS values than those of the non-elusive genes (Figure 3; Figure 3—figure supplement 1), likely consistent with not significant difference in gene expression levels between them in the species (Figure 5—figure supplement 1) (Cherry, 2010; Zhang and Yang, 2015). With respect to the chicken genome, the sequence features of the elusive genes might have been relaxed during evolution.”

(P32-33, Discussion) “A chromosomal-scale view of the distribution of elusive genes illuminated their significant correlation with the genes whose chicken orthologs are located on microchromosomes (Figure 7a). More importantly, genomic regions rich in elusive genes were traced back to the microchromosomes of the ancestral gnathostomes by reconstructing chromosomes of the ancestral genomes (Figure 7b). This inference of ancestral karyotypes augments our observations that some elusive natures of genomic sequences have been retained for hundreds of millions of years (Figure 3). In other words, the result suggests that the disparity of genomic regions which allows the ‘elusiveness’ for the genes has been retained during vertebrate evolution. On the other hand, comparisons of the expression profiles between the orthologs of the elusive and non-elusive genes for non-mammalian vertebrates indicate that the elusive genes are not always associated with the restricted expression profiles (Figure 5; Figure 5—figure supplement 1). Additionally, in the chicken genome, this trend in gene expression may be correlated with the abundance of ATAC-seq peaks in the elusive genes (Figure 6—figure supplement 5). These findings again suggest that the elusive genes are not always associated with a reduction in pleiotropy of gene expression, which may lead to an increase of functional dispensability during evolution. It should be noted that the choices of tissues used in these analyses were largely different between the human and non-mammalian vertebrates (Tables S3, S4). The chicken ATAC-seq data could be obtained only from developing embryos, while the human ATAC-seq in ENCODE were performed with cell lines. Therefore, the aforementioned interpretation should be treated carefully.”

(P40, Methods) “We also compared expression profiles and ATAC-seq peak densities between the orthologs of the elusive and non-elusive genes in nonmammalian vertebrates in a similar way as we did with the human datasets.

Normalized gene expression profiles from RNA-seq data of normal adult tissues and early embryos for chicken, green anole, Western clawed frog, coelacanth, and spotted gar were obtained from the Bgee version 15 database (Bastian et al., 2021) (Table S4). ATAC-seq narrow peak signals of chicken tissues were retrieved from NCBI GEO (Table S4) followed by coordination of the genome assembly with galGal5 with the UCSC liftOver tool (Hinrichs et al., 2006) as needed.”

b) In Figure 7, authors propose the concept related with microchromosomes. These chromosomes have been extensively studied, especially in birds. Some features of microchromosomes are consistent with that of elusive genes [e.g., high GC, (Bravo et al. 2021)]. However, microchromosomes are conserved in terms of gene order and their genes generally show high protein-level constraints as shown by low Ka/Ks (Waters et al. 2021; Li et al. 2022). Authors need to reconcile their discovery with the previous rich literature.

Responding to this reviewer’s comment, we have first examined whether the chicken orthologs of the elusive genes tend to be located on microchromosomes. We have classified the chicken orthologs of the elusive and non-elusive genes into those on macro- and microchrosomes. Fisher’s exact test has indicated that elusive genes are enriched in the chicken microchromomes (p=0.0175, odds ratio=1.46; Author response table 2). In addition, the elusive genes are enriched in the genomic regions corresponding to the ancestral microchromosomes in early vertebrates (p=9.50x10-24, odds ratio=2.31; Author response table 3).

Author response table 2. Number of chicken orthologs of elusive and non-elusive genes locating in macro- and microchromosomes.

Elusive Non-elusive
Macrochromosome (chr1-10, Z) 93 4211
Microchromosome (chr11-, W) 68 2078

Genes in non-chromosome scaffolds and mitochondrial genome were excluded Fisher's exact test p=0.0175, odds ratio=1.48.

Author response table 3. Number of elusive and non-elusive genes locating in the genomic regions derived from ancestral macro- and microchromosomes.

Elusive Non-elusive
Ancestral macrochromosome 395 5950
Ancestral microchromosome 296 1929

Genes in the genomic regions that did not correspond to the ancestral macro/microchromsomes were excluded. Fisher's exact test p=9.50x10-24, odds ratio=2.31.

As the reviewer noted, the genomic regions including the elusive genes share several genomic characteristics with microchromosomes: high GC content, high gene density, and high KA and KS values (although KA/KS values are lower for newly identified genes in the chicken microchromosomes in Li et al., 2022). On the contrary, previous analyses showed that the chicken microchromosomes tend to be early replicating and located in the A compartment in the nucleus, where the genes are actively transcribed. These observations are concordant with our finding that the chicken orthologs of the elusive genes are rich in ATAC peaks (see in the previous reply). We could not determine which epigenomic states are ancestral, but the genomic features of high GC-content and high nucleotide substitutions are not always correlated to reduced pleiotropy of gene expression, potentially leading to an increase in functional dispensability.

In the revised manuscript, we have created a new section Elusive gene orthologs in the chicken microchromosomes in Results to describe the enrichment of the elusive genes in the chicken microchromsomes and the association of their genomic characters with those of the chicken orthologs, while we cite the literature mentioned by the reviewer. In addition, we have mentioned the relevance of the elusive genes to the ancestral macrochromosomes in early vertebrates in Discussion. The corresponding sentences in the revised manuscript have already been included above in the previous.

c) Line (L) 204, among 982 human elusive genes, only 540~390 are shared by other species (e.g., shark). I suggest taking advantage of genome-level synteny based age data generated in Shao et al. 2019 to examine the age distribution of human elusive genes. If a high proportion of them are dated as being old (e.g., shared by jawed vertebrates), the statement that these genes have an ancient origin could be better supported.

We thank the reviewer for an opportunity to revisit the age of the elusive genes. We have examined this for the revised manuscript. Though the reviewer suggested using GenTree (http://gentree.ioz.ac.cn/) for this purpose, this database consists of euteleostome animals and does not include chondrichthyans and more distantly related animals. Instead, we have used the ortholog groups of jawed vertebrates built in this study and further added older branching dates of their orthologs and paralogs from the Ensembl Gene Tree. Of the 813 elusive genes in the revised dataset, 152 retain only mammalian orthologs, and this proportion is higher than that of the non-elusive genes (65 out of 8,050, p=2.50x10-110). We have then extracted the 517 elusive and 7,900 non-elusive genes whose ancestors were dated at the early evolution of jawed vertebrates or older. This subset of old gene age allowed us to examine if the elusive genes also retain loss-prone nature in non-mammalian vertebrates. On average, 40% of the 517 elusive genes are found to be retained in the nonmammalian vertebrates. On the other hand, more than 90% of the 7,900 non-elusive genes are retained by these species. These results indicate that while relatively young genes are frequently found in the elusive genes, the loss-prone nature of the elusive genes in the non-mammalian vertebrates is recapitulated by using this gene set of old origin. We have replaced Figure 3—figure supplement 1 with a revised version.

(P14, Tracing elusiveness back along the vertebrate evolutionary tree in Results) “We found that 152 out of 813 elusive genes originated in mammalian lineages, and this proportion was larger than those of the elusive genes (65 out of 8050, p = 2.50 ´ 10-110), indicating that the elusive genes are more abundant in recently born genes than non-elusive genes. We then selected 517 elusive and 7,900 non-elusive genes that originated in the common ancestors of jawed vertebrates or earlier. These subsets allowed us to examine the degree of retention of non-mammalian vertebrate orthologs in the elusive and non-elusive genes. On average, approximately 40% of these elusive genes were found to be retained by non-mammalian vertebrates, while this proportion increased up to 90% for the non-elusive genes. (Figure 3—figure supplement 1a). In the coelacanth, gar, and shark, the orthologs of the elusive genes were less frequently retained by all the species than those of the non-elusive ones (Figure 3—figure supplement 1b). The results suggest that the origins of the loss-prone propensity of the elusive genes potentially date back to the period long before the emergence of the Mammalia.”

Reviewer #3 (Recommendations for the authors):

– In recent years several gene loss pipelines were already published (e.g. TOGA – https://github.com/hillerlab/TOGA) and it would be highly beneficial for this study to compare their gene loss reports with output obtained from existing pipelines (which also address the false discovery rate issue)

Please refer to the reply to Essential revisions 1. We understand that TOGA is an ortholog search pipeline, whose output include gene loss information, employing synteny information. TOGA inputs genome alignments between the reference and all the target species, which requires tremendous computational time in our case involving a number of species. Instead, we have incorporated ortholog annotations implemented in RefSeq, another synteny-based ortholog detection platform, into our validation for gene loss.

– We did not appreciate Figure 5. We strongly recommend finding a more quantitative approach to visualise these results, since heat maps are misleading and show different x-axis and y-axis ranges (zoom-ins/ zoom-outs)

For the visibility of Figure 5, we have added the percentage of the genes of H'≥1 and H'<1 in the plots and performed the statistical test to see the fraction of the genes of H'≥1 and H'<1 is equal between the elusive and non-elusive genes. Additionally, we have coordinated x-axis and y-axis ranges between the elusive and non-elusive genes. We have replaced Figure 5 with a revised version containing these modified figure panels.

– All Figures and Supplementary Figures showing violin plots need to report the number of genes that underly these distributions.

We have added the sample numbers of the plots to all figures or figure legends.

[Editors' note: further revisions were suggested prior to acceptance, as described below.]

Apparently, the manuscript has been significantly improved but there are some remaining issues that need to be addressed, as outlined below. In view of these comments, we kindly request that you consider revising the manuscript once more. Our hope is that, through this additional revision, the manuscript will be written more clearly and rigorously. Thank you for your understanding and continued efforts in improving your submission.

Reviewer #2 made the following comment. Please consider it during revision.

"The authors attempt to argue that the elusive status is ancient ("Thus, the heterogeneous genomic features driving gene fates toward loss have been in place since the ancestral vertebrates"). However, in response to my previous suggestion regarding chicken microchromosomes, the authors present mixed results. They observed high GC content, high gene density, and short gene length in chicken, similar to the findings in humans (Figure 3). Yet, the critical functional data between the two species are conflicting: human elusive genes exhibit low expression and fewer ATAC-seq peaks, while their chicken counterparts display the opposite pattern. In other words, chicken elusive genes exhibit higher pleiotropy, which may decrease the likelihood of their loss. Thus, these genes are not elusive, and high GC content, high gene density, and short gene length do not necessarily predict elusiveness. Given that the authors only analyzed the functional data of human and chicken genomes, it is not possible to determine whether the "elusive" status is ancient or derived from the human or mammalian lineage. I suggest that the authors analyze the transcriptome data in shark or spotted gar to provide further phylogenetic context. Otherwise, the authors should significantly tone down their statement."

We are aware of your concern about the inconsistency of the results between human and chicken, and therefore already included comparisons of the gene expression profiles between the orthologs of the elusive and non-elusive genes with several non-mammalian vertebrates in the previous revision (Figure 5—figure supplement 1). This comparison indicated that the orthologs of the elusive genes were rich in those with lower pleiotropy in anole, coelacanth, and gar, while this tendency was not found in chicken and frog. The result suggests that the orthologs of the elusive genes are likely to have retained the “elusive” features in the common ancestor of bony vertebrates.

In addition, regarding the reviewer’s comment, we should have clearly stated that in non-mammalians, the genes we focused on are not necessarily ‘elusive genes’ but rather ‘the orthologs of the human elusive genes’ in the previous manuscript. These orthologs do not always retain the ‘elusiveness’ as the human genes do. Interestingly, the orthologs of the elusive genes in the chicken genome do not exhibit significant differences in KA and KS values from those of the non-elusive genes, which is potentially associated with increased pleiotropy of gene expression in these genes. These observations suggest that the orthologs of the human elusive genes we identified have increased functional importance in the lineage leading to chicken.

We have revised the following parts of Results and Discussion to describe transcriptomic characteristics of the orthologs of the human elusive genes in non-mammalians more clearly in the current manuscript as included below.

Elusive gene orthologs in the chicken microchromosomes in Results

“We further compared expression profiles between the orthologs of the human elusive and non-elusive genes in several non-mammalian vertebrates and found that the orthologs of the elusive genes tend to exhibit low pleiotropy in green anole, coelacanth, and gar but not in Western clawed frog. The result suggests that the low pleiotropy of the elusive genes has persisted at least since the bony vertebrate ancestors. With respect to the chicken genome, the ‘elusive’ features the genes orthologous to human elusive genes might have been relaxed —functional importance of the orthologs have increased—during evolution leading to chicken.”

Discussion

“On the other hand, comparisons of the expression profiles between the orthologs of the elusive and non-elusive genes for non-mammalian vertebrates suggest that the orthologs of the elusive genes have been associated with a reduction in pleiotropy of gene expression since vertebrate ancestors but acquired the diverse expressions in chicken and frog (Figure 5; Figure 5—figure supplement 1). Additionally, in the chicken genome, the diverse expressions of the chicken orthologs of the human elusive genes may be correlated with the abundance of ATAC-seq peaks (Figure 6—figure supplement 5). These findings again suggest that the chicken orthologs of the human elusive genes have increased pleiotropy of gene expression, which may lead to a lineage-specific acquisition of functional indispensability.”

The reviewing editor also has some comments on the title and abstract.

1. Please consider revising the title to "The Impact of Local Genomic Properties on the Evolutionary Fate of Genes Across Vertebrates".

Thank you for your suggestion. We would like to use the title that you suggested without the last two words (‘The Impact of Local Genomic Properties on the Evolutionary Fate of Genes’). This title is more likely to increase the accessibility of our paper to a broad readership beyond those who are primarily investigating vertebrates, potentially prompting them to investigate genomic properties associated with the fate of genes in diverse organisms such as invertebrates, fungi, and plants.

2. Please consider adding a sentence (or something similar) at the end of the abstract. "This study sheds light on the complex interplay between gene function and local genomic properties in shaping gene evolution across vertebrate lineages."

Thank you for your suggestion that strengthens our focus and conclusion. We have revised the abstract as follows.

Thus, the heterogeneous genomic features driving gene fates toward loss have been in place and may sometimes have relaxed the functional indispensability of such genes. This study sheds light on the complex interplay between gene function and local genomic properties in shaping gene evolution that has persisted since the vertebrate ancestor.

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Data Citations

    1. van Schaik T, Vos M, Peric-Hupkes D, Hn Celie P, van Steensel B. 2020. Dara from: Cell cycle dynamics of lamina-associated DNA. 4D Nucleome Data Portal. f1218a92-1f37-4519-85d6-ccedd5f7ad39 [DOI] [PMC free article] [PubMed]

    Supplementary Materials

    Figure 3—figure supplement 1—source data 1. A 2×2 contingency table in Figure 3—figure supplement 1.
    Supplementary file 1. Supplementary Tables S1, S3, S4.

    (a) Supplementary Table S1. Vertebrate species used for creating gene phylogenies. (b) Supplementary Table S3. ENCODE accession ID list used for epigenomic analyses. (c) Supplementary Table S4. RNA-seq and ATAC-seq samples of non-mammalian vertebrates.

    elife-82290-supp1.xlsx (38KB, xlsx)
    Supplementary file 2. Supplementary Table S2.

    Characteristics of the elusive and non-elusive genes in the human genome.

    elife-82290-supp2.zip (1.8MB, zip)
    MDAR checklist

    Data Availability Statement

    The current manuscript is a computational study, so no data have been generated for this manuscript. Data from ENCODE Project was used and is available at https://www.encodeproject.org, IDs are shown in Table S3 in Supplementary File 1.

    The following previously published dataset was used:

    van Schaik T, Vos M, Peric-Hupkes D, Hn Celie P, van Steensel B. 2020. Dara from: Cell cycle dynamics of lamina-associated DNA. 4D Nucleome Data Portal. f1218a92-1f37-4519-85d6-ccedd5f7ad39


    Articles from eLife are provided here courtesy of eLife Sciences Publications, Ltd

    RESOURCES