Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2020 Nov 2;18(11):e3000862. doi: 10.1371/journal.pbio.3000862

Many, but not all, lineage-specific genes can be explained by homology detection failure

Caroline M Weisman 1, Andrew W Murray 1, Sean R Eddy 1,2,3,*
Editor: Harmit S Malik4
PMCID: PMC7660931  PMID: 33137085

Abstract

Genes for which homologs can be detected only in a limited group of evolutionarily related species, called “lineage-specific genes,” are pervasive: Essentially every lineage has them, and they often comprise a sizable fraction of the group’s total genes. Lineage-specific genes are often interpreted as “novel” genes, representing genetic novelty born anew within that lineage. Here, we develop a simple method to test an alternative null hypothesis: that lineage-specific genes do have homologs outside of the lineage that, even while evolving at a constant rate in a novelty-free manner, have merely become undetectable by search algorithms used to infer homology. We show that this null hypothesis is sufficient to explain the lack of detected homologs of a large number of lineage-specific genes in fungi and insects. However, we also find that a minority of lineage-specific genes in both clades are not well explained by this novelty-free model. The method provides a simple way of identifying which lineage-specific genes call for special explanations beyond homology detection failure, highlighting them as interesting candidates for further study.


Lineage-specific gene families may arise from evolutionary innovations such as de novo gene origination, or may simply mean that a similarity search program failed to identify more distant homologs. A new computational method for modeling the expected decay of similarity search scores with evolutionary distance allows distinction between the two explanations.

Introduction

Homologs are genes that descend from a common evolutionary origin. “Lineage-specific genes” are defined operationally as genes that lack detectable homologs in all species outside of a monophyletic group [1]. Also referred to as “taxonomically-restricted genes” [2, 3], and as “orphan genes” when found only in a single species [4, 5], they are ubiquitous in the genomes of sequenced organisms. For example, by previous reports, 23% of Caenorhabditis elegans genes are specific to the Caenorhabditis genus [6]; 6% of honey bee genes are specific to insects [7]; 25% of ash tree genes are specific to the species [8]; and 1% of human genes are specific to primates [9].

Where do lineage-specific genes come from? A common interpretation is that they are “novel” genes. Various proposals for the molecular nature of this novelty have been advanced. For example, lineage-specific genes have been interpreted as having evolved from previously noncoding sequence (“de novo originated” genes) [5, 10, 11] and as duplicated genes that have gained a novel function, diverging radically and beyond recognition from their homologs in the process (“neofunctionalized” genes) [5, 12]. Though different in detail, these proposals share the key assumption that a lack of detectable homologs indicates some kind of biological novelty: Lineage-specific genes either have no evolutionary homologs or no longer perform the same function as their homologs outside the lineage [1316]. We refer to these interpretations collectively as the “novelty hypothesis” of lineage-specific genes. The novelty hypothesis has informed work on the evolution of new features at molecular, cellular, and organismal scales [1620].

An alternative explanation for a lineage-specific gene is that nothing particularly special has happened in the gene’s evolutionary history. The gene has homologs outside of the lineage (no de novo origination), and no novel function has emerged (no neofunctionalization), but despite this lack of novelty, computational similarity searches (e.g., BLAST) have failed to detect the out-of-lineage homologs. We refer to such unsuccessful searches as homology detection failure. As homologs diverge in sequence from one another, the statistical significance of their similarity declines. Over evolutionary time, with a constant rate of sequence evolution, the degree of similarity may fall below the chosen significance threshold, resulting in a failure to detect the homolog. Some lineage-specific genes may just be those for which this happens to have occurred relatively quickly, even in the absence of any novelty-generating evolutionary mechanisms.

The possibility of homology detection failure has long been recognized, but the key questions of how many and which lineage-specific genes are best explained by it remains unclear. Previous work has aimed to explicitly simulate the evolution of each gene. These approaches depend on many evolutionary parameters, to which results have proven sensitive, ranging widely within the same taxon [2127]. A recent study elegantly avoids this problem, estimating the overall rate of detection failure from a subset of genes for which direct comparison of syntenic orthologous coding sequences is possible, allowing highly diverged orthologs to be identified. Extrapolating this rate genome-wide, it proposes that around 40% and 25% of lineage-specific genes in fungal and fly lineages are due to detection failure [28]. This estimate relies on this subset of genes being a representative sample. Additionally, the underlying method cannot assess whether a particular lineage-specific gene is due to detection failure unless the syntenic orthologous region is identifiable, which is typically not the case.

Here, we describe a simple method for evaluating whether homology detection failure is sufficient to account for a particular lineage-specific gene. We develop a mathematical model that estimates the probability that a homolog would be detected at a specified evolutionary distance if it was evolving at a constant rate under standard, novelty-free evolutionary processes. The resulting method can be used on any gene with detected homologs in at least 2 taxa. We apply the method to all such lineage-specific genes in insects and yeasts and find that many, but not all, lineage-specific genes in these taxa can be explained by homology detection failure. This method should be easily applicable to other taxa, where it can be used to determine which lineage-specific genes are unlikely to be due to detection failure, highlighting them as candidates for true evolutionary novelty.

Results

A null model of homolog detectability decline as a function of evolutionary distance

We developed a formal test of the null hypothesis that homology detection failure is sufficient to explain the lineage specificity of a gene. Specifically, we model the scenario in which the gene actually existed in a deeper common ancestor, evolved at a constant rate, and has homologs outside the clade in which its homologs are detected that appear to be absent solely due to homology detection failure. This is an evolutionary null model: It invokes no processes beyond the simple scenario of orthologs diverging from a common ancestor and evolving at a constant rate.

Because of its use in previous work on lineage-specific genes and in sequence analysis more broadly, we use BLASTP (available from the National Center for Biotechnology Information, https://blast.ncbi.nlm.nih.gov/Blast.cgi) as the search program used to detect homologs here. In search programs like BLAST, sequence similarity is used to infer homology between 2 genes. Such programs report a similarity score (referred to as “bitscore” by BLAST) between a pair of sequences, as well as the number of sequences that would be expected to achieve that similarity score by chance (an E-value). When this number falls below a significance threshold (e.g., E < 0.001), statistically significant similarity is interpreted as evidence that the 2 genes are homologous. The similarity score therefore directly determines whether a homolog is successfully detected in a search.

The key idea in our method is to predict how the similarity score between 2 homologs evolving according to our null model is expected to decline as a function of the evolutionary distance between them. We can then ask whether a given gene’s lack of detectable homologs outside of the lineage is expected under this null evolutionary model.

We analytically modeled how the similarity score between 2 homologs decays with the evolutionary distance between them. A simple argument shows that similarity scores are expected to decline roughly exponentially with evolutionary time. Suppose we assume that the similarity score between 2 homologs is proportional to the percent identity between them and that every position in the protein changes at the same protein-specific rate, which is constant over evolutionary time. Then the expected similarity score S between 2 homologs separated by an evolutionary divergence time t is given by S(t) = Le−Rt, and the variance of this similarity score is given by σ2 = L(1 − eRt)(eRt) (S1 Supplemental Information). The protein-specific parameter L is related to the protein’s length. The protein-specific parameter R is related to the rate at which the protein accumulates substitutions over evolution and so incorporates protein-specific factors that contribute to that rate, including mutation rate at the locus and the effects of selection on the protein.

Although an exact derivation of this function assumes a substitution-only process with a single constant position-independent substitution rate, the same functional form will approximate the effects of site-specific rates (S1 Supplemental Information) and position-specific insertion/deletion. Whatever the detailed position-specific selection pressure on the protein, if it is constant over time, we expect similarity scores to decline roughly exponentially. We can empirically estimate this exponential by fitting L and R to observed scores at different divergence times. By subsuming the complex effects of selective pressure into a 2-parameter empirical model of similarity score decline, we avoid the need for a parameter-heavy model of sequence evolution. Minimizing the number of parameters in the model allows us to apply it to genes with a limited number of identified homologs (genes specific to very young lineages) while minimizing the problems of results being sensitive to parameter estimation that occur in complex models. The assumption that rate R is the same across evolutionary time and in all lineages is also clearly a simplification, but this is the null hypothesis that we aim to test: We aim to identify genes in which a lack of detected homologs is consistent with an expected constant decay of similarity scores with time, without any need to invoke lineage-specific appearance or rate shifts.

We can predict similarity scores for a given gene if we have 3 inputs: a gene from a chosen focal species, the similarity scores of successfully identified homologs of the gene in a at least 2 other taxa (S), and the relative evolutionary distances between the focal species and these other species (t). (As described in the following section and Methods, we precalculate these evolutionary distances t from an aggregate of many genes from the set of species under consideration, and therefore, they do not depend on the particular gene under consideration.) We use these inputs to find the gene-specific values of the parameters L and R that produce the best fit to our equation describing how similarity scores decline within the species in which homologs were detected. We then use these parameters to extrapolate and predict the expected similarity score of hypothetical homologs of the gene at evolutionary distances beyond those of the species whose homologs were used in the parameter fitting. Given an E-value threshold, this predicted similarity score, and the expected variance of the similarity score, we can estimate the probability that a homolog will be undetected at these longer evolutionary distances. In the analyses that follow, we use a relatively permissive E-value threshold of 0.001.

This key idea is illustrated in Fig 1, which shows examples of fitting similarity scores versus evolutionary distance for several different yeast and insect genes.

Fig 1. Depictions of the fit of the null model of similarity score, as defined in the text, decline with evolutionary distance for 3 representative proteins from Saccharomyces cerevisiae (a) and Drosophila melanogaster (b).

Fig 1

Colored points represent the BLASTP score between the protein and its ortholog in the species that is at the evolutionary distance indicated on the x-axis. Tick marks on the x-axis represent each of the species used here. For visual clarity, only some species names and evolutionary distances are included, indicated with black tick marks; gray tick marks represent the other unlabeled species. The dashed line represents the detectability threshold, the score below which an ortholog would be undetected at our chosen E-value of 0.001. The best-fit values of a and b are shown for each protein. The r2 value is also shown and was calculated from a linear regression of the log of the similarity score versus evolutionary distance. All data in these figures are available at https://github.com/caraweisman/abSENSE, under Fungi_Data (panel a) and Insect_Data (panel b).

The null model adequately describes the decay of ortholog detectability with evolutionary distance

We applied our model to genes of the yeast S. cerevisiae and the fly D. melanogaster and their orthologs in several fungal and insect outgroups, respectively. We focus on the fungi and insects because their genomes are well-annotated, they have closely related and well-annotated sister species, and they have been the focus of previous work on lineage-specific genes [5, 11, 2932]. For S. cerevisiae, we included 11 fungal species spanning a divergence time of approximately 600 million years [33, 34]; for D. melanogaster, we included 21 insect species spanning a divergence time of approximately 400 million years [35]. These species are listed in Fig 2.

Fig 2. Inferred evolutionary distances between each fungal species and S. cerevisiae (a) and each insect species and D. melanogaster (b).

Fig 2

The tree topologies for these taxa are based on previously published studies [34, 35] and were not calculated here; branch lengths are not to scale. The fungal sensu stricto lineage, referenced frequently in the text, is shaded in yellow.

Before using our null model to ask whether it explains the lack of detected homologs of lineage-specific genes, we confirmed that it is a good approximation of how similarity scores decay with evolutionary distance. To do this, we tested how well the model represents the decay of similarity scores of general S. cerevisiae and D. melanogaster genes in increasingly distant species. If the model fits this decay well for most genes, it is likely a good representation of the minimal evolutionary process in the null hypothesis and can therefore detect deviations from that process.

To obtain evolutionary distances from the focal species (t values), represented by the x-axis in Fig 1, we used 102 genes from the Benchmarking Universal Single Copy Ortholog (“BUSCO”) [36] database to calculate evolutionary distances in substitutions/site between S. cerevisiae and each of the 11 other fungi and 125 BUSCO genes to calculate evolutionary distances between D. melanogaster and each of the 21 other insects (Methods). We note that, because this approach directly calculates the pairwise distance in substitutions/site between the focal species and each other species, it incorporates changes in evolutionary rates across lineages without assuming a lineage-invariant molecular clock. In both taxa, to show that distances can be reliably computed using a small number of genes, we also re-calculated these distances using 2 random subsets of 15 BUSCO genes. Distances computed from these different gene sets were similar (S1 Table). Fig 2 shows evolutionary distances inferred from one of the 15 gene sets between the focal organism S. cerevisiae and the 11 other fungi and between the focal organism D. melanogaster and the 21 other insects. For reference, Fig 2 depicts these distances along with a topology taken from previous phylogenetic studies of these taxa [34, 35]; branch lengths are not to scale. We use these distances, computed from one of the 15 gene subsets, in all results presented in the main text beyond this point.

We next took all annotated S. cerevisiae and D. melanogaster proteins (S2 Table) and identified the similarity scores of their detectable orthologs in each of the 11 other fungal and 21 other insect outgroup species, respectively. (For S. cerevisiae and D. melanogaster, the score is the comparison of the protein with itself.) We identified orthologs using reciprocal best BLASTP search with a threshold of E < 0.001 (Methods). Reciprocal best BLASTP is not a perfect means of distinguishing orthologs from paralogs, and results in some genes failing to be assigned to orthologs in some species, but it suffices for the purpose and is easy to do at scale.

With these similarity scores (S) and evolutionary distances (t) in hand, we tested how well our model explains the observed decline in similarity scores with increasing evolutionary distance in fungal and insect genes. Our model predicts a linear relationship between the log of ortholog similarity scores and evolutionary distance. We therefore assessed the fit of the model by performing a linear regression of the log of each protein’s similarity score, ln S(t), against the inferred evolutionary distance to the focal species, t, and computing the square of the Pearson correlation coefficient (r2), which measures how much of the variance in ln S(t) is explained by t.

The model predicts similarity scores reasonably well. The mean and median r2 were 0.92 and 0.95 for similarity scores of S. cerevisiae genes (S1 Fig). We repeated this with D. melanogaster proteins and their orthologs in the other insects, where the mean and median r2 were 0.84 and 0.91, respectively, for similarity scores of D. melanogaster genes (S1 Fig). Results were similar using the 2 other sets of estimated distances (S1 Fig).

As well as considering the fit of each gene to the expected value of the model, we tested how well our estimate for the variance of the similarity score captured the observed scatter around this expected value. To do this, for the ortholog of each S. cerevisiae gene in each species, we calculated the difference between the actual and expected similarity score and expressed it as a multiple of the predicted standard deviation σ = √L(1 − eRt)(eRt) of the similarity score (a Z-score). We expect these Z-scores to follow a normal distribution if our model’s estimated variance is correct, which is roughly what we observe (S2 Fig). Approximately 92% of S. cerevisiae orthologs have observed scores within 3 SD of the prediction; for a standard normal distribution, 99% are expected. Of the remaining 8% of scores, 7% are below 3 SD, and 1% are above 3 SD. Results in D. melanogaster are similar: 88% have observed scores within 3 SD, 8% below, and 4% above. There is some skew toward predicted scores that are higher than observed scores. We attribute this to the fact that our model neglects how insertions and deletions may disrupt the length of a local alignment. Results were similar when using the 2 other sets of estimated distances (S2 Fig).

Our method strictly requires only 2 orthologs, possibly covering only a very short evolutionary distance, to estimate L and R. To assess how well these parameters can be estimated in this worst-case scenario, we next asked how similar the values of L and R inferred from only 2 species very closely related to S. cerevisiae (S. paradoxus and S. mikatae, at evolutionary distances of 0.02 and 0.09 substitutions/site, respectively) were to those inferred above, using all orthologs available in the full set of 11 fungal species, including the most distant species Schizosaccharomyces pombe at a distance of 0.92 substitutions/site. We find that concordance between these estimates is good. The r2 and average percent difference between these estimates is 0.99 and 0.6% for the L parameter, respectively, and 0.78 and 8% for the R parameter.

Finally, we asked whether the best-fit values of the parameters L and R found for the fungal proteins are correlated with the interpretation of these parameters in our model. We expect a protein’s value of L to be related to its length and R to be related to evolutionary rate. For all S. cerevisiae genes, we plotted L versus protein length and R versus evolutionary distance in substitutions/site from multiple alignments of each protein from S. cerevisiae and the 4 most closely related species (Methods). The L parameter is indeed highly correlated with gene length (r2 = 0.99), and the R parameter is more weakly correlated with gene evolutionary rate (r2 = 0.51) (S3 Fig). We attribute some of this lower correlation to the fact that R, which describes how quickly score declines, includes the effects of insertions and deletions as well as substitutions, whereas standard measures of evolutionary rate derived from alignments (like the gene evolutionary rates we calculated to compare with R) only consider substitutions. The distributions of the estimated L and R parameters across all genes are long-tailed and approximately log-normal (S4 Fig), consistent with other analyses of distributions of gene length [37] and evolutionary rate [38].

Many lineage-specific genes can be explained by homology detection failure

Having validated our null model for similarity score decline, we then focused on lineage-specific genes and used the model to ask our central question: How often is homology detection failure alone enough to explain a lineage-specific gene?

We first considered annotated S. cerevisiae proteins that are lineage-specific to the sensu stricto yeasts, a young lineage sharing a common ancestor approximately 20 million years ago (Mya) containing the 5 species S. cerevisiae, S. paradoxus, S. mikatae, S. bayanus, and S. uvarum (Fig 2a), which has been the focus of previous work on lineage-specific genes [11, 31]. We identified 375 such sensu stricto–specific genes, defined as having homologs detectable by BLASTP in at least one of these species but lacking detectable homologs in the nearest outgroup S. castellii or in any other outgroups according to a permissive E-value threshold of 0.001 (Methods). Between 40% and 70% of sensu stricto–specific genes identified in 2 previous studies are included in this set [11, 31]. The remainder are either open reading frames (ORFs) not used in our initial search because they are marked as dubious in both the Saccharomyces Genome Database and Refseq and so have been removed from the S. cerevisiae Refseq annotation, or because we detected homologs outside of the sensu strictos, likely due to our permissive E-value threshold. Because our detectability model is regression-based, a minimum of 3 observed homologs (including the gene in the focal species) are required; for example, we could not perform this computation on the S. cerevisiae gene BSC4 [39], proposed to have a very recent de novo origin and thus only found in S. cerevisiae. We applied our model to the 155 such sensu stricto–specific proteins.

For each of these 155 lineage-specific genes, we used the best-fit values of the L and R parameters found here previously to extrapolate and predict the score of an ortholog at the evolutionary distance of S. castellii under the null model. Using parameters from the sensu stricto lineage to extrapolate to more distant species corresponds to assuming that these 2 groups of orthologs have evolved in the same manner since their divergence from their common ancestor. Finally, we calculated the probability that a homolog at the evolutionary distance of S. castellii would be detected, P(detected | null model, tcastellii), by using our model for similarity score variance to generate a probability distribution for the score and computing the percentage of the probability mass in this distribution below our chosen detectability threshold (corresponding to an E-value of 0.001).

This analysis is illustrated for one example of a sensu stricto–restricted S. cerevisiae protein, Uli1, in Fig 3. Uli1 has been implicated in the unfolded protein response [40], making it one of only a few sensu stricto–specific genes with experimental evidence of function, and its lineage specificity has prompted previous studies to propose that it originated de novo [11, 31]. However, we find that the probability that an ortholog of this gene would be detectable in S. castellii, P(detected | null model, tcastellii), is approximately 0, indicating that a null evolutionary model is sufficient to explain the lineage specificity of this short and rapidly evolving gene.

Fig 3. Illustration of the prediction of detectability decline for the S. cerevisiae protein Uli1, displayed as in Fig 1.

Fig 3

At the evolutionary distance of the nearest outgroup S. castellii, the entire prediction interval lies below the detectability threshold, indicating an approximately 0% probability that an ortholog would be detected under the null model even if an S. castellii ortholog were present. Data in this figure are available at https://github.com/caraweisman/abSENSE/tree/master/Fungi_Data.

The result of performing this test on all of the 155 sensu stricto–specific genes amenable to our analysis is shown in Fig 4a, which depicts the distribution of probabilities of detecting a homolog in the outgroup S. castellii given the null model and the evolutionary distance between S. cerevisiae and S. castelli, P(detected | null model, tcastellii). Many genes have a very high probability of being undetected, and a majority are more likely to be undetected than detected: 55% have P(detected | null model, tcastellii) below 0.05, and 73% are below 0.5. This implies that homology detection failure is sufficient to explain a large number, potentially a majority, of these lineage-specific genes. Homologs of these genes only being detected in sensu stricto species does not require invoking evolutionary novelty.

Fig 4. Distributions of detectability prediction results for 3 yeast lineages (a, b, c).

Fig 4

Top: results for all lineage-specific genes. Middle: results of the same analysis for all non-lineage-specific genes, which serve as a positive control. These genes, which have detectable orthologs outside of the lineage, should be predicted to be detected, which they largely are. Bottom: depiction of the lineage (yellow) and closest outgroup (blue) considered in the analyses in the corresponding column. In c), note that Yarrowia lipolytica is the topological outgroup to the shaded lineage but is not the closest species by evolutionary distance (branch lengths are not to scale). Data in this figure are available at https://github.com/caraweisman/abSENSE/tree/master/Fungi_Data.

We repeated this procedure for D. melanogaster genes restricted to the Drosophila genus. This young lineage shared a common ancestor approximately 70 Mya, with the housefly Musca domestica as the nearest outgroup in our analyses (Fig 5a). We identified 1,611 Drosophila-restricted genes (Methods), of which 1,278 had the 2 identified orthologs in the Drosophila lineage required for our analysis. Again, many of these Drosophila-restricted genes are very likely to be undetected: 46% have values of P(detected | null model, tdomestica) below 0.05, and 76% are below 0.5 (Fig 5a). Homology detection failure is therefore also sufficient to explain many lineage-specific genes in this group.

Fig 5. Distributions of detectability prediction results for 3 insect lineages (a, b, c).

Fig 5

Top: results for all lineage-specific genes. Middle: results of the same analysis for all non-lineage-specific genes, which serve as a positive control. These genes, which have detectable orthologs outside of the lineage, should be predicted to be detected, which they largely are. Bottom: Depiction of the lineage (yellow) and closest outgroup (blue) considered in the analyses in the corresponding column. In a), note that Ceratitis capitata is the topological outgroup to the shaded lineage, but is not the closest species by evolutionary distance (branch lengths are not to scale). Data in this figure are available at https://github.com/caraweisman/abSENSE/tree/master/Insect_Data.

As both the sensu stricto yeasts and the drosophilid flies are relatively young lineages, we asked whether these results generalize to older lineages. In fungi, we tested 2 additional lineages with approximate divergence times of approximately 70 Mya (Fig 4b) and approximately 250 Mya [33] (Fig 4c). In insects, we also tested 2 additional lineages, with approximate divergence times of approximately 150 Mya (Fig 5b) and approximately 350 Mya [35] (Fig 5c). We identified all genes specific to each of these 4 additional lineages and then calculated P(detected | null model, toutgroup) for all genes with the required 2 identified orthologs, exactly as described for the 2 aforementioned lineages. Results in all of these comparisons are very similar to those in the younger lineages tested here: we predict that a large number of lineage-specific genes have very low probabilities of being detected, with a majority more likely to be undetected than detected (Figs 4b, 4c, 5b and 5c). Homology detection failure is thus sufficient to explain a large number of lineage-specific genes in these older lineages as well. All genes specific to the 6 lineages considered here and their values of P(detected | null model, toutgroup) can be found in S3 Table.

As a control, we asked our model to predict the probability of detecting homologs of genes that are not lineage-specific, meaning that these genes have homologs that are detected both inside and outside of the lineage. We repeated the same procedure on all non-lineage-specific genes in the 6 lineages tested here. As we did for the lineage-restricted genes, we used only similarity scores from orthologs within the given lineage to calculate the probability of detecting homologs in the nearest outgroup to the lineage, P(detected | null model, toutgroup). If our model operates correctly, it should predict high values of P(detected | null model, toutgroup) for these genes, because their homologs are, in fact, detected. In accordance with this expectation, our model predicts that the vast majority (>97% in all lineages) of these genes have a very high probability of being detected, P(detected) > 0.95 (Figs 4 and 5). This analysis, like earlier analyses, was robust to the use of different sets of genes for calculating evolutionary distances (S4 Table).

We separately considered the set of 784 S. cerevisiae genes marked as “dubious” in the Saccharomyces Genome Database [41]. Although they have been deemed unlikely to encode functional proteins, many of them are lineage-specific and so have been included in previous studies as potentially novel genes [11]. We analyzed the 167 of these dubious genes that met our analysis requirement of having detected orthologs in at least 2 other species (many are unique to S. cerevisiae). We find that homologs of these genes would be undetected at an even higher rate than for validated genes; in all 3 fungal lineages, at least 99% of these dubious ORFs have P(detected | null model, toutgroup) below 0.5, and at least 80% are below 0.05 (S5 Table).

Another recent paper used a different approach to estimate the fraction of lineage-specific genes that are attributable to homology detection failure [28]. Vakirlis and colleagues used a small set of genes in “microsyntenic blocks” to count how often a gene is not recognizably similar to its presumptive homolog in the syntenic position in a comparative genome. Assuming that this sample is representative, and so approximates the frequency at which homologous genes in general diverge beyond recognition, they conclude that 20%–45% of lineage-specific genes in fungal, insect, and vertebrate phylogenies are attributable to homology detection failure. We consider this result to be in qualitative agreement with ours, concluding that homology detection failure generates a substantial fraction of lineage-specific genes. However, we produce somewhat larger estimates of the rate of detection failure. We investigated the cause of this discrepancy in additional analyses described in detail in S1 Supplemental Info. Briefly, we find that genes in microsyntenic regions evolve more slowly than those outside of such regions, so using this subset of genes leads to a lower inferred rate of detection failure. Consistent with this observation, when we restrict our analysis in fungi to consider only those lineage-specific genes that are within microsyntenic regions, we find a lower rate of homology detection failure that is approximately consistent with the estimate of Vakirlis and colleagues.

More sensitive homology searches detect beyond-lineage homologs for many lineage-specific genes well-explained by homology detection failure

If a gene being lineage-specific is due to the failure of BLASTP to detect homologs that are in fact present, we would expect that a more sensitive search will sometimes succeed in finding homologs where BLASTP did not. We asked whether this was the case for genes whose lineage specificity was consistent with the hypothesis of detection failure: Can we use a more sensitive method to find previously undetected homologs for these genes? We refer to such homologs, detected using a different method in species outside of the originally defined lineage, as “beyond-lineage homologs.”

We used sensu stricto yeast–specific genes as a case study to ask this question. These yeasts and several of their nearest outgroups have a high degree of conservation of chromosomal gene order (synteny), presenting the opportunity for a more sensitive search. A standard similarity search tests all proteins in a large database of sequences, such as a complete proteome. The resulting multiple testing burden requires a higher score to achieve statistical significance than would be required for a search over a smaller number of sequences. In these yeasts, synteny allows us to restrict a similarity search to 1 candidate gene at the orthologous chromosomal locus, reducing the multiple testing burden and enabling ortholog identification with a lower score. For the fungal species used here, a proteome-wide search would need a BLASTP score of approximately 37 to achieve an E-value of 0.001, but a single-protein search would only require a score of approximately 24. Orthologs with scores between these 2 values would be missed in our initial search but successfully detected with synteny-guided similarity searches.

We used this strategy to search for beyond-lineage orthologs for all sensu stricto–specific genes for which the null model of detection failure is a reasonable explanation. We use a threshold of P(detected | null model) < 0.95 to define these genes. This choice is a conservative threshold that corresponds to genes that are insignificant according to a traditional significance test threshold of P(undetected | null model) = 1 –P(detected | null model) > 0.05. There are 126 sensu stricto–specific genes that pass this threshold.

To identify the orthologous locus in outgroup yeasts for these 126 S. cerevisiae genes, we used the Yeast Gene Order Browser (YGOB), an online resource that curates the chromosomal orthology relationships between species including the sensu stricto yeasts, S. castellii, Kluyveromyces waltii, Ashbya gossypii, and K. lactis [42]. Of these 126 sensu stricto–specific genes, 19 are included in YGOB and have an orthologous locus in at least one of these outgroup yeasts. For all of these genes, the upper bound of the 99% prediction interval for the similarity score predicted by our model is above the detectability threshold of 24 bits, indicating that they are potentially detectable by this analysis. Of these 19 genes, 17 had an annotated gene at the orthologous locus in at least one outgroup species. For 11 of these, at least one of these genes at an outgroup orthologous locus had significant detectable similarity (E < 0.001) to the S. cerevisiae gene. In all but 2 of these cases, the similarity score fell within our prediction interval (in those 2 cases, the similarity score was slightly higher than predicted). These 11 genes and their proposed orthologs are listed in S5 Table.

In total, we found beyond-lineage homologs for 46% of genes for which we were able to perform a synteny analysis. We note that this is a conservative estimate. We only considered ORFs that are already annotated in outgroup species, although unannotated orthologs may be present. Additionally, the lower bound of the 99% similarity score prediction interval for all remaining 54% of these genes is lower than the threshold required for detection via synteny, so that all have some probability of orthologs still being missed in this analysis.

Some lineage-specific genes are poorly explained by homology detection failure

In all lineages studied here, there are also lineage-specific genes that are poorly explained by the null hypothesis: Their similarity scores decline too slowly to make homology detection failure alone a good explanation for their lineage specificity. These are the genes with high values of P(detected | null model). In all 6 lineages we studied, 10%–20% of lineage-specific genes have detection probabilities of 0.95 or greater (Figs 4 and 5).

This result is illustrated by one such sensu stricto–specific protein, Spo13, in Fig 6. Spo13 has been proposed as a candidate de novo gene [31] by virtue of its lineage specificity, and this analysis highlights it as a particularly promising novel gene candidate among the large number of other lineage-specific genes in the sensu stricto lineage.

Fig 6. Detectability prediction results for the S. cerevisiae protein Spo13, displayed as described in Fig 3.

Fig 6

At the evolutionary distance of the nearest outgroup S. castellii, the entire prediction interval lies well above the detectability threshold, indicating an approximately 100% probability that an ortholog should be detected in this species under the null model. Data in this figure are available at https://github.com/caraweisman/abSENSE/tree/master/Fungi_Data.

The existence of lineage-specific genes like Spo13, which our null model predicts should have detectable homologs outside of the lineage, indicates that evolutionary mechanisms beyond those included in the null model may be operating. Among such mechanisms are those postulated by the novelty hypothesis, like de novo origination and duplication-induced neofunctionalization. However, other known mechanisms could also explain such genes. These include processes that cause the gene tree to deviate from the species tree, like horizontal gene transfer and any mechanisms that change the evolutionary rate of a protein on a restricted part of the tree.

Characterization of yeast lineage-specific genes that are poorly explained by homology detection failure

We next aimed to characterize genes whose lineage specificity is poorly explained by homology detection failure. We again used sensu stricto–specific genes as a case study, allowing for synteny analysis and the biological insight provided by many genes in S. cerevisiae being comparatively well-studied. We selected the subset of sensu stricto–specific genes, including Spo13, whose lineage-specificity is poorly explained by homology detection failure, i.e. for which P(detected | null model) > 0.95. These are genes on the other side of the threshold applied above: the null hypothesis strongly predicts that homologs should be detected, making their lineage specificity incompatible with the null hypothesis. There are 25 sensu stricto–specific genes that satisfy this threshold. Although a thorough study of these genes is beyond our scope, we report a few initial observations.

“De novo origination,” the process of a new gene emerging from previously noncoding sequence, is a commonly proposed origin of lineage-specific genes [12]. We asked how many of these 25 lineage-specific genes could plausibly be such de novo genes. By definition, genes that have emerged de novo in the sensu stricto lineage should have no out-of-lineage homologs, and so a more sensitive synteny-based homology search strategy should fail to find such homologs. We performed a synteny-based search for out-of-lineage homologs for these 25 genes in the same way as described for genes that are well-explained by detection failure. For 20 of these 25 genes, an orthologous locus is listed in YGOB. Of these 20, 12 have annotated genes with significant similarity (E < 0.001) at the orthologous locus in at least 1 outgroup species. Thus, 12 of 25 genes, or just under half, of genes that are not well explained by homology detection failure did not originate de novo in the sensu stricto lineage. This is a conservative estimate of the total number of genes that have out-of-lineage homologs, because, as described here, even this synteny-based homology search has finite sensitivity. Spo13, the gene shown in Fig 5, is one example of these lineage-specific genes that nonetheless are not de novo originated: It has out-of-lineage orthologs identifiable by synteny in S. castellii, K. waltii, K. lactis, and A. gossypii.

Genes that acquire a new function following duplication and divergence (“neofunctionalization”) are another proposed source of lineage-specific genes [12]. We therefore asked how many of our sensu stricto–specific genes have a paralog, consistent with the hypothesis that they emerged through duplication and divergence. Based on BLASTP searches within the S. cerevisiae genome, we find that 4 of the 25 lineage-specific genes have annotated paralogs specific to some subset of the sensu stricto yeasts, which therefore likely emerged after their divergence from S. castellii. We also find using YGOB that another 4 of these 25 genes have annotated paralogs resulting from the yeast whole-genome duplication, which occurred before the divergence of S. castellii from the sensu stricto yeasts. In total, 8/25, or fewer than one third, of these genes show evidence of having been the result of duplication events. However, we note that this estimate for the number of genes with paralogs is again conservative because of the finite sensitivity of the homology searches.

Finally, we performed a gene ontology enrichment test (Methods) to determine if certain biological processes were statistically overrepresented among these 25 genes. We find significant enrichment of genes involved in several GO categories relating to spore formation and meiosis, including “ascospore-type prospore membrane assembly” (p = 7*10−5; 3 observed versus 0.7 expected) and “meiotic cell cycle process” (p = 5*10−5; 7 observed versus 1 expected). Spo13, involved in meiotic cell cycle regulation through its roles in maintaining sister chromatid cohesion during meiosis I and promoting kinetochore attachment [43], is one such example. By contrast, no biological processes were overrepresented among lineage-specific genes that are consistent with homology detection failure (although these genes are much less likely to have GO annotations at all: 92% have no annotation, compared with 12% of all cerevisiae genes and 36% of lineage-specific genes that are inconsistent with homology detection failure).

A table of these 25 genes and the features discussed in this section can be found in S6 Table.

Discussion

The widespread interpretation of lineage-specific genes as evolutionarily novel assumes that absence of evidence for detectable homologs in outgroups is evidence that homologs are absent. The model we have presented here allows us to formally test the alternative, null hypothesis: Homologs do exist outside the specified lineage, but they have diverged, at a constant novelty-free evolutionary rate, beyond the ability of a similarity search program to detect them. We find that this hypothesis is sufficient to explain a large number of lineage-specific genes in 2 taxa in which lineage-specific genes have been interpreted as exhibiting some kind of evolutionary novelty. These results caution against automatically assuming that lineage-specific genes are novel.

Two important caveats should be kept in mind. First, this method cannot exclude the possibility that a gene is truly novel, but also short enough and evolving fast enough that its ortholog would not be detected if present, such that homology detection failure can also explain its lineage specificity. For this reason, it may be difficult for de novo genes in particular, which have been hypothesized to be short and fast-evolving, to reject the null model. However, in this case, where 2 hypotheses can both explain a lineage-specific gene, we argue that additional evidence should be required to prefer the comparatively exotic hypothesis of novelty to the more conservative one of detection failure. Our case study in the sensu stricto yeasts finds that more sensitive synteny-based homology searches successfully find previously undetected homologs for many lineage-specific genes, supporting this argument. Second, these results may or may not generalize to classes of lineage-specific genes that we have not considered here. Because our method requires at least 2 observed orthologs, we have only applied it here to genes found in at least 2 species in the lineages in question. Moreover, like other studies, we focused on genes in existing annotations, which are prone to biases that may exclude novel genes. However, when we analyze cerevisiae ORFs with the requisite 2 orthologs that are marked as of “dubious” coding status in the Saccharomyces Genome Database, we find that an even larger proportion—nearly all—are unlikely to be detected. Although we have not done so here, we note that this method could be extended to these classes of genes. In principle, individual conspecifics with sufficient genetic differentiation could be used as discrete taxa in our method to analyze genes found in only 1 or 2 species. Additionally, the method is readily applicable to any protein annotation, which could be custom-made as desired.

Although we find that many lineage-specific genes can be adequately explained by homology detection failure, we also find a minority of lineage-specific genes in fungi and insects that cannot. This leaves open the possibility that these genes are biologically novel. However, the reason that these genes reject our null model is not addressed by our present work. Our initial analyses do show that many of these genes are neither de novo genes nor have detectable paralogs, suggesting that processes other than the commonly proposed hypotheses of de novo origination and duplication-divergence may be at play. There are many possible processes that could cause genes to deviate from our null model, but one speculative example lies in the observed enrichment in yeast of genes involved in meiotic processes, exemplified by Spo13. This strikes us as suggestive of meiotic drive phenomena, which have been observed in yeast [44] and have been shown to cause rapid protein divergence [45], producing clade-specific rate accelerations leading to lineage-specific genes. More detailed characterization of these genes is required to understand if and in what way they are evolutionarily novel.

There is increasing consensus that homology detection failure is frequent [28]. It should be taken into account in studies that aim to use lineage-specific genes to identify candidates for genetic novelty. To this end, our approach allows us to determine whether a particular lineage-specific gene is attributable to homology detection failure. Synteny analyses of the kind used here can sometimes be used to determine whether out-of-lineage orthologs are present [46, 47] and can provide strong evidence of de novo origination [48], but syntenic analyses are only possible in the limited taxa where sequenced species are related closely enough that synteny is conserved. By contrast, our method can be used in any set of species for which relative evolutionary distances are known. We expect it to be useful in the wide variety of studies that aim to identify novel genes that may underlie the evolution of morphological, behavioral, and other novel traits [7, 4953]. An implementation of our method, with all raw data and results presented here, is freely available as source code at github.com/caraweisman/abSENSE and as a web server at eddylab.org/abSENSE.

Methods

Identification of S. cerevisiae and D. melanogaster orthologs

We downloaded previously annotated proteomes of all species used here from several sources, largely Refseq and GenBank. Accession IDs for Refseq and GenBank proteomes and download links for those from other sources are listed in S2 Table. We performed a BLASTP (version 2.8.0) search [54] with an E-value threshold of 0.001 using the S. cerevisiae proteome as the query against each of the 11 other yeast proteomes independently. We also performed the reciprocal of each of these searches, using each of the 11 other yeast proteomes as the query against the S. cerevisiae proteome. We used a custom Python script to identify reciprocal best BLAST hits for each S. cerevisiae protein in each of the other yeast proteomes. A protein in another yeast’s proteome was considered a reciprocal best hit to the S. cerevisiae protein if (1) the E-value of the S. cerevisiae protein against that protein was the lowest of any in that species’ proteome and (2) the E-value of that protein against the S. cerevisiae protein was the lowest of any protein in the S. cerevisiae proteome. Proteins in the other yeast species satisfying this reciprocal best hit criterion were considered orthologs of the S. cerevisiae protein. When no significant homology to a S. cerevisiae protein was detected in another species, or when the reciprocal best hit criterion was not met by any protein in that species, no ortholog was assigned in that species. To identify orthologs for D. melanogaster proteins, we repeated this same procedure for all D. melanogaster proteins and each of the 21 other insect species’ proteomes.

Calculation of evolutionary distances

Because evolutionary distance t only appears in our model as a product with the gene-specific rate parameter R, we can use a subset of genes in the species group to infer these relative distances. Each gene’s value of R will scale these relative distances appropriately when fit to the model: Genes that evolve faster than these relative distances will have values of R above 1, and slower, below. We used BUSCO genes as the subset of genes from which to estimate distances, as they are generally well-conserved, facilitating ortholog identification and alignment. This enables our desired result of a species tree with correct relative evolutionary distances (in substitutions/site across aligned BUSCO genes), which is the only feature needed by our downstream inference. We downloaded a list of eukaryotic BUSCO genes [36] from the BUSCO web server (https://busco.ezlab.org/) and identified all of these genes for which we were able to identify an ortholog of the corresponding S. cerevisiae gene in all 11 other yeast species (“Identification of orthologs”). We found 102 such BUSCO genes. We used the alignment software MUSCLE (version 3.8.31) [55] with default parameters to create a multiple sequence alignment of the orthologs from all 12 yeast species for of each of these 102 genes. We then concatenated these alignments and used the Protdist program from the PHYLIP software package (version 3.696) [56] with default parameters to find pairwise evolutionary distances for all 12 yeast species in substitutions per site. To test the effect of using a smaller number of genes to infer these distances, we then randomly and independently selected 2 subsets of 15 of these 102 genes and performed the same alignment and distance calculation procedure on each of these 2 subsets. We then performed the same procedure using D. melanogaster genes and the 21 other insect species. Here, there were 125 BUSCOs for which we were able to identify orthologs in all species, and the 2 random subsets of 15 genes were selected from among these 125. Refseq accessions for genes in the 3 sets of BUSCOs in both taxa are listed in S7 Table.

Correlation of R parameter with evolutionary rate

To determine the correlation between each gene’s best-fit value of the R parameter in our model and the substitution rate, we used alignments of 5,261 S. cerevisiae genes and their orthologs in all 4 other sensu stricto yeast species generated by a previous study [57]. We opted not to include more distantly related species in these alignments for the sake of more reliable ortholog identification and alignment construction. We used the protdist function of the PHYLIP package (version 3.696) [56] on these alignments to infer the number of substitutions per site between the S. cerevisiae gene and its ortholog in the most distant sensu stricto yeast S. kudriavzevii (we chose a fairly distant representative of these species to minimize sampling error from low substitution counts) and correlated this value with the R parameter inferred from the regression analysis.

Identification of lineage-specific genes

To identify S. cerevisiae genes specific to the 3 yeast lineages tested here, we performed a BLASTP search [54] with an E-value threshold of 0.001 for each gene in the S. cerevisiae proteome as the query against each of the 11 other yeast proteomes independently, using the same proteomes listed in S2 Table. If the BLASTP search detected no homologs of the S. cerevisiae gene in the proteomes of any of these species outside of the specified lineage, we considered it lineage-specific. We applied the same criterion using the 21 other insect proteomes to identify D. melanogaster genes specific to the 3 insect lineages tested here.

Synteny-based homology searches

We used version 7 of the YGOB’s online web tool (http://ygob.ucd.ie/) [42]. For tested S. cerevisiae genes, if the gene was included in this YGOB version, we determined whether an orthologous chromosomal region in any of the outgroup yeast species used here had been identified in the browser. If so, we searched for any genes in these outgroup species at the locus that were annotated in the browser. We considered genes to be within the outgroup orthologous locus if they were between the outgroup’s orthologs of the closest S. cerevisiae genes up- and downstream of the query gene. If annotated genes existed at the orthologous locus, we performed a BLASTP search of the S. cerevisiae sequence against the sequences of all outgroup genes at that locus as listed in YGOB and called orthology in cases where this single-search E-value was <0.001.

Gene ontology analysis

We used the Gene Ontology Consortium’s online web server (http://geneontology.org/) [58] to test whether or not certain biological functions were enriched in the set of sensu stricto–specific genes that we found to be poorly explained by detection failure. We performed a Fisher’s exact test using the “GO biological process complete” annotation data set for all S. cerevisiae genes.

Calculation of substitutions/site and gaps for sensu strictu alignments

Although Vakirlis and colleagues (2020) [28] provide dN values computed from alignments of S. cerevisiae genes and their sensu strictu orthologs, these underlying alignments are restricted to the 5,261 genes for which orthologs were identified in all 5 sensu strictu species in a previous study [57]. Because we worried that this might introduce a bias against quickly evolving genes, for which orthologs are less readily identifiable, we opted to make our own alignments, including all 5,586 genes for which we could identify an ortholog in at least one of the 4 other sensu strictu species. We used both MUSCLE [55] and ClustalOmega [59] with default settings to produce a multiple alignment of each cerevisiae gene and its orthologs in at least one other sensu strictu species. We then used these alignments to compute substitutions/site between S. cerevisiae and S. bayanus with the PHYLIP ProtDist program as in our other evolutionary distance calculations (“Calculation of evolutionary distances” section). We chose S. bayanus because it is the most distant species from cerevisiae according to our analysis (Fig 2). We then used all genes with an ortholog present in S. bayanus, regardless of its status in the 3 other yeast species, in the subsequent analysis. Results from the 2 alignment programs were extremely similar, as were results using distances to the slightly closer S. kudriavzevii.

For the analysis shown in S1 Supporting Information, we then used these same alignments to count the total number of gaps in each alignment and divide by the number of columns and number of sequences in the alignment to calculate the gaps per column per sequence.

Detectability prediction analysis of microsyntenic lineage-specific genes

S1 Supporting Information, we aimed to re-perform our original analysis of fungal lineage-specific genes but restricted to genes determined to be in microsyntenic regions by Vakiriis and colleagues (2020) [28]. We included genes in this analysis as follows. We started with the same list of genes specific to the lineages for which S. castellii and K. waltii are the closest outgroups (Fig 4) as in our original analysis. From these, we selected genes that Vakirlis and colleagues determined to be in a microsyntenic region in at least one of (1) a species within that lineage; (2) that species itself (S. castellii or K. waltii); or (3) another outgroup species to the lineage of very similar divergence time to (2). We chose to include genes in microsyntenic regions in species within the lineage and not just in its closest outgroup to be maximally conservative and because the number of genes in microsyntenic regions only in the outgroup species was low. We chose to allow for another outgroup species of similar divergence time to be substituted for the species that we used as outgroup because the set of species for which Vakirlis and colleagues performed a synteny analysis did not overlap exactly with the set of species used here (for example, K. waltii was not included), such that this was the closest approximation possible using those data. In the case of S. castellii, these species included S. arboricola, S. kudriavzeviii, and S. castellii itself. In the case of K. waltii, these species included K. lactis, A. gossypii, L. thermotolerans, E. cymbalariae, A. aceri, S. arboricola, S. kudriavzeviii, and S. castellii.

Supporting information

S1 Table. Inferred distances in substitutions/site from S. cerevisiae to each yeast species (top) and from D. melanogaster to each insect species (bottom).

Distances were inferred from all BUSCOs with orthologs identifiable in each species group, as well as from 15 genes randomly selected from these BUSCOs. The “15 BUSCOs subset 1” distances were used for all main figures in the text. BUSCO, Benchmarking Universal Single Copy Ortholog.

(XLSX)

S2 Table. Sources of species protein annotations used in this study.

(XLSX)

S3 Table. All lineage-specific genes and their values of P(detected | null model, toutgroup) for the 6 lineages (3 fungi, 3 insect) considered here.

(XLSX)

S4 Table. Correlation coefficients for gene detectability prediction results based on evolutionary distance estimates derived from 3 different sets of genes (the same as those shown in S1 Fig).

(XLSX)

S5 Table. List of 11 S. cerevisiae genes for which synteny-based searches in YGOB revealed candidate out-of-lineage orthologs, the YGOB IDs of those orthologs, and their synteny search E-values.

YGOB, Yeast Gene Order Browser.

(XLSX)

S6 Table. List of sensu stricto–specific S. cerevisiae genes that are poorly explained by the hypothesis of detection failure and their features as described in summary in the text.

(XLSX)

S7 Table. List of RefSeq accession IDs for BUSCOs used in evolutionary distance calculations.

BUSCO, Benchmarking Universal Single Copy Ortholog.

(XLSX)

S1 Fig. r2 distributions for the fit to the model of S. cerevisiae and D. melanogaster genes using evolutionary distances derived from 3 sets of genes.

a: S. cerevisiae genes with distances derived from 102 BUSCOs. b: S. cerevisiae genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in a. c: S. cerevisiae genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in a. d: D. melanogaster genes with distances derived from 125 BUSCOs. e: D. melanogaster genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in d. f: D. melanogaster genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in d. In d-f, the peak near r2 = 0 is comprised of genes with orthologs identifiable only in a subset of the closely related Drosophilid flies, such that their sequences are identical or nearly identical in all species, except 1 or 2 in which a large chunk of the melanogaster protein is absent from the annotation, resulting in almost none of the variance in score (of which there is none, save this large event) being explained by divergence time. We consider this an artifact of the method, as it only appears in the limited cases where the sequences in question are almost totally identical. Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures. BUSCO, Benchmarking Universal Single Copy Ortholog. BUSCO, Benchmarking Universal Single Copy Ortholog.

(EPS)

S2 Fig. Distribution of position of BLASTP scores between S. cerevisiae and outgroup yeast (top) and D. melanogaster and outgroup insects (bottom) relative to the predicted confidence interval.

0 indicates that the score has the same value as the best fit to the model; multiples of sigma indicate that the score is that many standard deviations above or below the best-fit value. a: S. cerevisiae genes with distances derived from 102 BUSCOs. b: S. cerevisiae genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in a. c: S. cerevisiae genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in a. d: D. melanogaster genes with distances derived from 125 BUSCOs. e: D. melanogaster genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in d. f: D. melanogaster genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in d. Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures. BUSCO, Benchmarking Universal Single Copy Ortholog.

(EPS)

S3 Fig. Correlation between best-fit parameters and gene properties in yeast.

a: Correlation between each S. cerevisiae protein’s best-fit value of a and its length in amino acids. The a parameter is consistently larger than the length due to most identical alignment positions contributing a score larger than 1 according to the scoring scheme used here (BLOSUM62). b: Correlation between each S. cerevisiae protein’s best-fit value of b and its relative evolutionary rate in substitutions per site from sensu stricto protein alignments (Methods). Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures.

(EPS)

S4 Fig. Distribution of best-fit parameter values for all S. cerevisiae proteins.

a: Distribution of the best-fit a values for all S. cerevisiae proteins. b: Distribution of the best-fit b values for all S. cerevisiae proteins. Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures.

(EPS)

S5 Fig. Results of “dubious” ORF analysis in S. cerevisiae.

Top: Distributions of detectability prediction results for all S. cerevisiae lineage-specific genes annotated as of”dubious” coding status in the Saccharomyces Genome Database [41] in 3 yeast lineages (a, b, c). Bottom: Depiction of the lineage (yellow) and closest outgroup (blue) considered in the analyses in the corresponding column. In c), note that Y. lipolytica is the topological outgroup to the shaded lineage, but is not the closest species by evolutionary distance (branch lengths are not to scale). Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures. ORF, open reading frame.

(EPS)

S1 Supporting Information. Supplemental information.

Justification for the functional form of the model; effect of site-specific selection pressure; analysis of data in Vakirlis and colleagues (2020).

(DOCX)

Abbreviations

BUSCO

Benchmarking Universal Single Copy Ortholog

Mya

million years ago

ORF

open reading frame

YGOB

Yeast Gene Order Browser

Data Availability

All data used in these analyses and the scripts necessary to reproduce them are available in the Supporting information and on our code repository at http://www.github.com/caraweisman/abSENSE.

Funding Statement

This work was primarily funded by a Howard Hughes Medical Institute investigator award to SRE. SRE is also supported in part by NIH (R01-HG009116) and AWM is supported in part by grants from the NIH (RO1-GM43987), and the NSF-Simons Center for the Mathematical and Statistical Analysis of Biology (NSF #1764269, Simons #594596). Computations were done on the Cannon cluster supported by the FAS Division of Science, Research Computing Group at Harvard University. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Cai JJ, Woo PC, Lau SK, Smith DK, Yuen K-Y. Accelerated evolutionary rate may be responsible for the emergence of lineage-specific genes in ascomycota. Journal of Molecular Evolution. 2006;63:1–11. 10.1007/s00239-004-0372-5 [DOI] [PubMed] [Google Scholar]
  • 2.Wilson G, Bertrand N, Patel Y, Hughes J, Feil E, Field D. Orphans as taxonomically restricted and ecologically important genes. Microbiology. 2005;151:2499–501. 10.1099/mic.0.28146-0 [DOI] [PubMed] [Google Scholar]
  • 3.Khalturin K, Hemmrich G, Fraune S, Augustin R, Bosch TC. More than just orphans: are taxonomically-restricted genes important in evolution? Trends in Genetics. 2009;25:404–13. 10.1016/j.tig.2009.07.006 [DOI] [PubMed] [Google Scholar]
  • 4.Dujon B. The yeast genome project: what did we learn? Trends in Genetics. 1996;12:263–70. 10.1016/0168-9525(96)10027-5 [DOI] [PubMed] [Google Scholar]
  • 5.Domazet-Loso T, Tautz D. An evolutionary analysis of orphan genes in Drosophila. Genome Research. 2003;13:2213–9. 10.1101/gr.1311003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Zhou K, Huang B, Zou M, Lu D, He S, Wang G. Genome-wide identification of lineage-specific genes within Caenorhabditis elegans. Genomics. 2015;106:242–8. 10.1016/j.ygeno.2015.07.002 [DOI] [PubMed] [Google Scholar]
  • 7.Johnson BR, Tsutsui ND. Taxonomically restricted genes are associated with the evolution of sociality in the honey bee. BMC Genomics. 2011;12:164 10.1186/1471-2164-12-164 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Sollars ES, Harper AL, Kelly LJ, Sambles CM, Ramirez-Gonzalez RH, Swarbreck D, et al. Genome sequence and genetic diversity of European ash trees. Nature. 2017;541:212 10.1038/nature20786 [DOI] [PubMed] [Google Scholar]
  • 9.Toll-Riera M, Castelo R, Bellora N, Albà M. Evolution of primate orphan proteins. Biochemical Society Transactions. 2009;37:778–82. 10.1042/BST0370778 [DOI] [PubMed] [Google Scholar]
  • 10.Neme R, Tautz D. Phylogenetic patterns of emergence of new genes support a model of frequent de novo evolution. BMC Genomics. 2013;14:117 10.1186/1471-2164-14-117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Carvunis A-R, Rolland T, Wapinski I, Calderwood MA, Yildirim MA, Simonis N, et al. Proto-genes and de novo gene birth. Nature. 2012;487:370 10.1038/nature11184 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tautz D, Domazet-Lošo T. The evolutionary origin of orphan genes. Nature Reviews Genetics. 2011;12:692 10.1038/nrg3053 [DOI] [PubMed] [Google Scholar]
  • 13.Domazet-Lošo T, Brajković J, Tautz D. A phylostratigraphy approach to uncover the genomic history of major adaptations in metazoan lineages. Trends in Genetics. 2007;23:533–9. 10.1016/j.tig.2007.08.014 [DOI] [PubMed] [Google Scholar]
  • 14.Arendsee ZW, Li L, Wurtele ES. Coming of age: orphan genes in plants. Trends in Plant Science. 2014;19:698–708. 10.1016/j.tplants.2014.07.003 [DOI] [PubMed] [Google Scholar]
  • 15.Luis Villanueva-Cañas J, Ruiz-Orera J, Agea MI, Gallo M, Andreu D, Albà MM. New genes and functional innovation in mammals. Genome Biology and Evolution. 2017;9:1886–900. 10.1093/gbe/evx136 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Khalturin K, Anton-Erxleben F, Sassmann S, Wittlieb J, Hemmrich G, Bosch TC. A novel gene family controls species-specific morphological traits in Hydra. PLoS Biol. 2008;6:e278 10.1371/journal.pbio.0060278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Bowles AM, Bechtold U, Paps J. The Origin of Land Plants Is Rooted in Two Bursts of Genomic Novelty. Current Biology. 2020. 10.1016/j.cub.2019.11.090 [DOI] [PubMed] [Google Scholar]
  • 18.Thomas GW, Dohmen E, Hughes DS, Murali SC, Poelchau M, Glastad K, et al. Gene content evolution in the arthropods. Genome Biology. 2020;21:1–14. 10.1186/s13059-019-1925-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Šestak MS, Domazet-Lošo T. Phylostratigraphic profiles in zebrafish uncover chordate origins of the vertebrate brain. Molecular Biology and Evolution. 2015;32:299–312. 10.1093/molbev/msu319 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Richter DJ, Fozouni P, Eisen MB, King N. Gene family innovation, conservation and loss on the animal stem lineage. eLife. 2018;7:e34226 10.7554/eLife.34226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Moyers BA, Zhang J. Evaluating phylostratigraphic evidence for widespread de novo gene birth in genome evolution. Molecular Biology and Evolution. 2016;33:1245–56. Epub 2016/01/14. 10.1093/molbev/msw008 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Moyers BA, Zhang J. Further simulations and analyses demonstrate open problems of phylostratigraphy. Genome Biology and Evolution. 2017;9:1519–27. 10.1093/gbe/evx109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Moyers BA, Zhang J. Phylostratigraphic bias creates spurious patterns of genome evolution. Molecular Biology and Evolution. 2014;32:258–67. 10.1093/molbev/msu286 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Moyers BA, Zhang J. Toward reducing phylostratigraphic errors and biases. Genome Biology and Evolution. 2018;10:2037–48. 10.1093/gbe/evy161 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Elhaik E, Sabath N, Graur D. The “inverse relationship between evolutionary rate and age of mammalian genes” is an artifact of increased genetic distance with rate of evolution and time of divergence. Molecular Biology and Evolution. 2005;23:1–3. 10.1093/molbev/msj006 [DOI] [PubMed] [Google Scholar]
  • 26.Albà MM, Castresana J. On homology searches by protein BLAST and the characterization of the age of genes. BMC Evolutionary Biology. 2007;7:53 10.1186/1471-2148-7-53 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Domazet-Lošo T, Carvunis A-R, Albà M, Šestak MS, Bakarić R, Neme R, et al. No evidence for phylostratigraphic bias impacting inferences on patterns of gene emergence and evolution. Molecular Biology and Evolution. 2017;34:843–56. 10.1093/molbev/msw284 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Vakirlis N, Carvunis AR, McLysaght A. Synteny-based analyses indicate that sequence divergence is not the main source of orphan genes. eLife. 2020;9 Epub 2020/02/19. 10.7554/eLife.53500 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wissler L, Gadau J, Simola DF, Helmkampf M, Bornberg-Bauer E. Mechanisms and dynamics of orphan gene emergence in insect genomes. Genome Biology and Evolution. 2013;5:439–55. 10.1093/gbe/evt009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Palmieri N, Kosiol C, Schlötterer C. The life cycle of Drosophila orphan genes. eLife. 2014;3:e01311 10.7554/eLife.01311 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Vakirlis N, Hebert AS, Opulente DA, Achaz G, Hittinger CT, Fischer G, et al. A molecular portrait of de novo genes in yeasts. Molecular Biology and Evolution. 2017;35:631–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ekman D, Elofsson A. Identifying and quantifying orphan protein sequences in fungi. Journal of Molecular Biology. 2010;396:396–405. 10.1016/j.jmb.2009.11.053 [DOI] [PubMed] [Google Scholar]
  • 33.Beimforde C, Feldberg K, Nylinder S, Rikkinen J, Tuovila H, Dörfelt H, et al. Estimating the Phanerozoic history of the Ascomycota lineages: combining fossil and molecular data. Molecular Phylogenetics and Evolution. 2014;78:386–98. 10.1016/j.ympev.2014.04.024 [DOI] [PubMed] [Google Scholar]
  • 34.Fitzpatrick DA, Logue ME, Stajich JE, Butler G. A fungal phylogeny based on 42 complete genomes derived from supertree and combined gene analysis. BMC Evolutionary Biology. 2006;6:99 10.1186/1471-2148-6-99 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Misof B, Liu S, Meusemann K, Peters RS, Donath A, Mayer C, et al. Phylogenomics resolves the timing and pattern of insect evolution. Science. 2014;346:763–7. 10.1126/science.1257570 [DOI] [PubMed] [Google Scholar]
  • 36.Waterhouse RM, Seppey M, Simão FA, Manni M, Ioannidis P, Klioutchnikov G, et al. BUSCO applications from quality assessments to gene prediction and phylogenomics. Molecular Biology and Evolution. 2017;35:543–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Zhang J. Protein-length distributions for the 3 domains of life. Trends in Genetics. 2000;16:107–9. 10.1016/s0168-9525(99)01922-8 [DOI] [PubMed] [Google Scholar]
  • 38.Koonin EV. Are there laws of genome evolution? PLoS Comput Biol. 2011;7:e1002173 10.1371/journal.pcbi.1002173 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Cai J, Zhao R, Jiang H, Wang W. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics. 2008;179:487–96. 10.1534/genetics.107.084491 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Metzger MB, Michaelis S. Analysis of quality control substrates in distinct cellular compartments reveals a unique role for Rpn4p in tolerating misfolded membrane proteins. Molecular Biology of the Cell. 2009;20:1006–19. 10.1091/mbc.e08-02-0140 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Ng PC, Wong ED, MacPherson KA, Aleksander S, Argasinska J, Dunn B, et al. Transcriptome visualization and data availability at the Saccharomyces Genome Database. Nucleic acids research. 2020;48:D743–D8. 10.1093/nar/gkz892 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Byrne KP, Wolfe KH. The Yeast Gene Order Browser: combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research. 2005;15:1456–61. 10.1101/gr.3672305 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Shonn MA, McCarroll R, Murray AW. Spo13 protects meiotic cohesin at centromeres in meiosis I. Genes & Development. 2002;16:1659–71. 10.1101/gad.975802 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Nuckolls NL, Núñez MAB, Eickbush MT, Young JM, Lange JJ, Jonathan SY, et al. wtf genes are prolific dual poison-antidote meiotic drivers. eLife. 2017;6:e26033 10.7554/eLife.26033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Malik HS, Henikoff S. Adaptive evolution of Cid, a centromere-specific histone in Drosophila. Genetics. 2001;157:1293–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.McLysaght A, Hurst LD. Open questions in the study of de novo genes: what, how and why. Nature Reviews Genetics. 2016;17:567 10.1038/nrg.2016.78 [DOI] [PubMed] [Google Scholar]
  • 47.Guerzoni D, McLysaght A. De novo genes arise at a slow but steady rate along the primate lineage and have been subject to incomplete lineage sorting. Genome Biology and Evolution. 2016;8:1222–32. 10.1093/gbe/evw074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Zhuang X, Yang C, Murphy KR, Cheng CC. Molecular mechanism and history of non-sense to sense evolution of antifreeze glycoprotein gene in northern gadids. Proceedings of the National Academy of Sciences. 2019;116:4400–5. 10.1073/pnas.1817138116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Aguilera F, McDougall C, Degnan BM. Co-option and de novo gene evolution underlie molluscan shell diversity. Molecular Biology and Evolution. 2017;34:779–92. 10.1093/molbev/msw294 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Surm JM, Stewart ZK, Papanicolaou A, Pavasovic A, Prentis PJ. The draft genome of Actinia tenebrosa reveals insights into toxin evolution. Ecology and Evolution. 2019;9(19): 11314–11328. 10.1002/ece3.5633 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Milde S, Hemmrich G, Anton-Erxleben F, Khalturin K, Wittlieb J, Bosch TC. Characterization of taxonomically restricted genes in a phylum-restricted cell type. Genome Biology. 2009;10:R8 10.1186/gb-2009-10-1-r8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Behl S, Wu T, Chernyshova A, Thompson G. Caste-biased genes in a subterranean termite are taxonomically restricted: implications for novel gene recruitment during termite caste evolution. Insectes Sociaux. 2018;65:593–9. [Google Scholar]
  • 53.Shigenobu S, Stern DL. Aphids evolved novel secreted proteins for symbiosis with bacterial endosymbiont. Proceedings of the Royal Society B: Biological Sciences. 2013;280:20121952 10.1098/rspb.2012.1952 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25:3389–402. 10.1093/nar/25.17.3389 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research. 2004;32:1792–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Felsenstein J. PHYLIP (phylogeny inference package), version 3.5 c: Joseph Felsenstein.; 1993.
  • 57.Scannell DR, Zill OA, Rokas A, Payen C, Dunham MJ, Eisen MB, et al. The awesome power of yeast evolutionary genetics: new genome sequences and strain resources for the Saccharomyces sensu stricto genus. G3: Genes, Genomes, Genetics. 2011;1:11–25. 10.1534/g3.111.000273 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Ontology CG. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Research. 2018;47:D330–D8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology. 2011;7:539 10.1038/msb.2011.75 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

2 Mar 2020

Dear Dr Eddy,

Thank you for submitting your manuscript entitled "Many but not all lineage-specific genes can be explained by homology detection failure" for consideration as a Research Article by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review. Many thanks for being upfront about the recent eLife paper from the McLysaght group. As you may know, PLOS Biology has a strong "anti-scooping" policy (https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.2005203), which means that this will not impact consideration of your study. Indeed, the concordant findings from orthogonal methods are, if anything, a positive.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Please re-submit your manuscript within two working days, i.e. by Mar 04 2020 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed all checks it will be sent out for review.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

Decision Letter 1

Roland G Roberts

6 Apr 2020

Dear Dr Eddy,

Thank you very much for submitting your manuscript "Many but not all lineage-specific genes can be explained by homology detection failure" for consideration as a Research Article at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by five independent reviewers. I'm very sorry about this unusually high number of reviewers (we usually aim for 4 or 5), but we wanted to secure some key expertise, and several delivered in very quick succession.

The Academic Editor has asked me to make it clear that given the number of reviewers and the very unusual circumstances surrounding the Covid-19 crisis, we are open to discussion about which of the reviewers' points are essential to address for further consideration, so do feel free to run a revision plan past us.

You'll also see that several of the reviewers wonder whether you should focus your manuscript more around the method. If you choose to go down that route, you might want to consider changing the article type to "Methods and Resources," which has a lesser requirement for novel biological insight. We leave it up to you to decide whether to do this or to keep it as a regular Research Article.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 2 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

Caroline Weisman and colleagues present a superbly written and reasoned analysis that helps answer an important and timely question in evolutionary genetics: how often do genes evolve from non-protein coding sequence? This question has recently received much theoretical, computational and experimental attention in similarly high-profile journals (e.g., Nature Communications, eLife), so it is an entirely appropriate one for PLOS Biology and should interest many readers. This manuscript represents an important contribution to the field and should result in future researchers exercising considerably increased caution before they claim that a lineage-specific gene has evolved de novo or through an otherwise "exotic" evolutionary process.

Typically, lineage-specific genes (which can include de novo genes, but also other classes) are identified by a lack of BLASTP hits in outgroup species. While such evidence is consistent with genetic novelty, novelty is not the only hypothesis that can explain lineage restriction. Here, the authors present a confident, yet careful, analysis of an important alternative: genes that appear to be lineage specific may actually have homologs in outgroup species that are simply too divergent to detect by algorithms such as BLASTP. Elegant in its simplicity (and explained well enough for even non-computational biologists to understand), the method essentially calculates the rate of sequence similarity decay among the species in which homologs can be found. Then, based on this rate, the method determines in which, if any, more distantly related species a homolog should be "findable." If a protein's sequence similarity to its orthologs decays slowly, yet orthologs are not detected, this provides potential evidence for genetic novelty. The authors find that this is sometimes the case in two groups that have been well-studied for novel genes, yeast and insects. However, the authors show that it is much more common for orthologs to simply diverge so much that they become undetectable by BLASTP. This key result, and supporting analyses asking critical questions such as whether considerations of synteny can clarify matters, will be important for anyone working in this area to grapple with if they attempt to claim a gene is novel. Helpfully to the research community, the authors also include both the code used for the analysis and an easily searchable website where yeast and fly researchers can enter their favorite genes to see in which species that should be detectable.

I have no major issues with the manuscript as written, as both the text and the figures are clear, well-reasoned, and easy to follow. The authors are also upfront about some minor limitations to their analyses, which is appreciated. The minor points below are simply for the authors' consideration as potential additional details to consider discussing:

1. I have been trying to think through whether there are any limitations to using the BUSCO set of proteins to draw inferences about potential lineage-specific proteins, since these classes may differ in some notable parameters. For example, proteins in the BUSCO set are likely to be much longer than a hypothetical, recently evolved de novo evolved protein, to evolve more slowly (since wide conservation is a criterion for inclusion in BUSCO), to have more complex structures, and to perhaps be less prone to duplication. Do the authors think that any of these differences place any limitations on their analyses? I suspect not, since shorter lengths and faster evolution are likely to exacerbate the issue of non-detectability... which just supports the authors' conclusion that these are major limitations to making a claim about novel evolution.

2. On a related note, one issue that the manuscript highlights for me is the special difficulty posed by protein length in assessing lineage specificity. Even the shortest proteins shown in the examples of Figure 1 are long compared to most of the proteins discussed in studies of novel/de novo genes (e.g., fly eIF4B is 459 a.a.). The graphs in the figure illustrate that the shorter the length, the lower a protein starts on the similarity score y-axis, and thus the less room the score has to fall before it an ortholog of a given distance becomes undetectable at e = 0.001. If the authors see any use in adding a bit in the discussion about how protein length affects their results and conclusions (e.g., are shorter proteins more likely to fall into the non-detectable category?), it could be interesting, but I defer to their judgement here.

3. Did the authors look into how much changing the BLASTP e-value cut-off would affect their results?

4. Another criterion sometimes used to support the de novo emergence hypothesis is that the protein should lack structural similarity to other proteins, as would be expected for a protein that recently emerged from random sequence. This might be useful to consider here, since structural conservation tends to decay less quickly than primary sequence conservation. If there is a good way of assessing this, do proteins that fall into the "should be detectable, but aren't" category (i.e., probability values close to 1 in the top row of graphs in Figs. 4-5) have any major patterns of structural differences from proteins that fall into the "too divergent to be detected" category (probability values close to 0 in these figures)? (If it is easier to look into this specifically for, say, the 25 yeast genes described on p. 9 whose absence outside of the sensu stricto group is poorly explained by homology failure, that would be fine.)

5. While the web server for checking individual genes is already very useful, it would also be helpful for the authors to list in a supplemental table the supposedly lineage-specific individual S. cerevisiae and D. melanogaster genes that were used in the analyses for Figs. 4-5, along with their probabilities of detection at each outgroup species "t." This could make it easier for other groups to contribute to the further characterization the manuscript calls for, the genes whose lineage-specificity cannot be easily explained by homology detection failure could be prioritized.

Thanks to the authors and the PLOS Biology for the opportunity to review such a nice, interesting, and important manuscript!

Reviewer #2:

The authors ask whether high evolutionary rates could explain the occurrence of orphans, i.e. genes that are only found in specific lineages. This is an old question and the authors address it by proposing a null-model for the evolutionary divergence of proteins and comparisons of yeast and Drosophila gene sets in the framework of this model. The paper touches on rather general concepts of evolution of proteins that have come in cycles in the past decades. The question of detectability by BLASTP has also been subject of discussion since quite some time. Accordingly, the issues addressed, as well as the conclusions, have been discussed before in various combinations and at various times. Of course, with today´s genomic data, the conclusions could potentially be better supported. But the curse of the genomic era is that algorithms and statistics dominate the outcome, without carefully looking at underlying assumptions and possible artefacts. The present paper is no exception in this respect. In addition, it is presented in an odd way, starting from the manuscript organization to the failure of providing the relevant data. The latter compromises proper reviewing and I therefore recommend rejection at this stage, simply because a full evaluation of the specific claims is not possible. But even if properly updated, it would likely better fit into a journal specialized on molecular evolution or algorithm development.

General:

The idea of a null model for the evolution of proteins with a constant substitution rate over time goes back to Zuckerkandl after doing the first globin sequence comparisons (if I remember right). It sparked the question of the possible existence of a molecular clock, which was extremely controversial for some time. The discussions resolved into the conclusion that there is apparently indeed a set of proteins that follow clock-like patterns, but that there are also others that behave rather differently, with lineage-specific changes in substitution rates. Further, it has become clear that whole lineages of taxa can show accelerated substitution rates. In molecular phylogeny studies, it had therefore become standard to test clock assumptions before using a gene for a phylogeny, since one would otherwise deal with artefacts. Hence, the assumption of the present paper that a constant decay rate, evenly spaced across a protein, would be a suitable null-model is an idealistic concept, which is only partially supported by the available data. There were therefore good reasons why previous papers on this question have taken more complex parameters into account (as cited). Why one would want to fall back behind these standards is not clear.

In the present paper, a lineage specific acceleration has not been considered. Diptera (e.g. including Drosophila) show lineage specific acceleration rates, while beetles (e.g. including Tribolium) show particularly low rates (Savard et al. 2006, BMC Evol Biol. 25;6:7). Hence, rate calculations obtained from fly comparisons cannot be directly projected to other insects without correcting for this effect. The yeast dataset would also have to be checked for this potential problem.

Second, it is also important to check whether there are even or uneven substitution rates along the length of a given gene, i.e. whether it includes a domain with high conservation in a background of low conservation (for example, most transcription factors and receptors fall into this class). Since BLAST requires only a small seed sequence for detecting a homologue, a short conserved domain is often sufficient to find it, even though the E-value may be pretty low, because of the divergence around it. This was the key argument of Alba and Castresana (ref 24) and it is rather unclear why the current authors throw it over board. The claim that it is better to make more simple assumptions may be appropriate in physics, but is seldomly correct in biology. The few examples they show in the paper cannot convince, since these are picked examples - while the primary data for all analyses are not provided. I expect that one could also pick different examples from them that would show a different pattern. Further, the use of BUSCO genes for calibration (without providing the relevant primary data) is rather one-sided, since this is a very select group of genes (and anyway usually used for other purposes, i.e. it is unclear why rate validation could be based on it).

Another problem that has been extensively discussed in the past is whether the routines for gene annotation generate biases on their own, which in turn bias conclusions drawn from them. There are actually many papers that show this now, starting from the insight that a filter on minimum ORF length does not make much sense, to the realization that quite a few coding regions include more than one functional ORF. Also, annotators consider an annotation as less reliable when no homologs are found in other species and tend therefore to remove them. Hence, when one relies on the annotated list of proteins only, especially on secondarily "curated" lists as it is done here, one loses many of the novel genes. Accordingly, suggesting general fraction-numbers for the relative detectability (or non-detectability) of novel genes does not seem appropriate when one has a biased dataset from start.

Yet another old discussion is the question of proper alignments to calculate substitution rates. In the early times when people have started to do this, there were clear recommendations that every alignment had to be manually inspected, that indels had to be treated in a consistent way (usually removed throughout the alignment together with some flanking regions) and that length changes needed to be considered etc. In fact, producing an appropriate alignment was a major achievement on its own. The present paper just uses a single algorithm (MUSCLE) and runs it under default parameters only, apparently without further manual inspection. However, it is well known that different parameters need to be used for different proteins, especially the gap penalty parameter can make a huge difference for the substitution rate calculation derived from such an automatic alignment.

If one removes the quantitative message from the paper, there is still a credible qualitative message, namely that a set of genes diverges beyond recognition because of fast evolution, while other genes evolve more slowly, yet show no matches in distant species. This is an important conclusion, but merely confirms what has been shown before (ref 5). These previous authors had basically asked the same question, came to the same qualitative conclusions and even the discussion is similar.

Finally - and this has also been an active discussion in the past years - the authors fail to distinguish between novel genes and de novo evolved genes. The papers and reviews on de novo genes from the recent years make it very clear that a proof for de novo evolution can only come from comparisons between very closely related species, where synteny with ancestral non-coding sequences can be shown. The question of detectability by BLAST is not an issue in these cases, since the genomes to be compared should me >90% similar anyway. Hence, the present paper adds at best speculations to this discussion. The discussion on novelty is also problematic, since it is based on the assumption that two proteins that have diverged beyond recognizability would still have the same function since they are homologs (i.e. have a common evolutionary descent). But, for example, the fore limbs of all vertebrates are homologous and have nonetheless developed different lineage-specific functions. Homology alone does not imply function and this should apply also to proteins. Hence, the identification of remote homologues by itself does not yield insight into the evolutionary novelty question.

Actually, it would be much more appropriate if this paper would be framed within the context of the evolutionary origin question of proteins that goes back to the famous 1977 paper of Francois Jacob on "Evolution and Tinkering". A main conclusion in the present paper seems to be that the majority of genes has remote homologs, which implies that they would have arisen already at the times of the origin of life, i.e. the authors would probably support the Jacob scenario. It is up to them how they want to discuss this in the light of proven cases for de novo evolution, but this is actually the real question that they are addressing in their study.

Overall, with its simplified assumptions, this paper has more the character of an academic exercise than leading us a large step forward. It uses a idealistic concept of protein evolution, develops a corresponding algorithm for testing it and comes up with some interesting observations in detail. However, the overall conclusions are not really new and the conclusions on percentages of detectability are based on ignoring the necessity for parameterization, as well as a limited dataset that has excluded the more interesting genes for this question from start.

Specific comments:

- the separation of figures from the text with a further separation of figure legends from the figures makes the paper very difficult to review unless one prints it out; this is a bit outdated

- no line numbers are included, which makes it difficult to list specific comments; hence, although I would have many, I do not add them here, since the paper would anyway require a major revision

Reviewer #3:

I found this paper thoughtful and very well written, presenting a novel and useful new methodology, and am happy to suggest that it be accepted by plos biology with some minor revisions. I have just a few general comments and specific suggestions for the manuscript:

General comments:

1) It strikes me that readers may be surprised by how well the proposed neutral model, which simply accounts for length and divergence time/evolutionary rate, fits the data, without accounting for any other forces, e.g. selection to retain function, etc. Some discussion of neutral divergence and the evolutionary forces acting at the protein level could be included in the introduction or discussion.

On a similar note, possibly outside the scope of this work, I wonder if different categories of genes, or genes general to many lineages as opposed to being lineage specific, fit the null model equally well?

2) I would like to see more discussion of the effect of differences in evolutionary rate over time, and over different parts of the protein- especially since the model fit for b is worse to the real gene-specific evolutionary rate.

3) Results page 16- I find the gene ontology enrichment results difficult to interpret, given that there is no comparison to GO enrichment searches for genes that are not poorly explained by homology detection failure.

4) discussion- vakirlis et al find evidence that considerably fewer genes are missed due to homology detection failure than the authors. Can the authors expand on this discrepancy further? I found it quite surprising, especially since these Vakirlis et al. also use yeast and drosophila.

Suggestions for clarity:

- Notation: the authors use a and b for their model parameters. These seem somewhat arbitrary choices, I suggest replacing them throughout: for example, replace 'a' with 'l', for length and 'b' with something like 'r', for rate, in order to help readers follow and distinguish between them.

- Notation formatting inconsistency: e is not constantly italicised throughout the manuscript.

- Fig 1: the legend could include an overview of what is meant by similarity score, for completeness. The dashed line also appears to be a different width in a than in b, the line in 1A should be made finer, so it is more clearly not at 0.

- Figure 4 and 5- the authors could consider rearrangement? Following the text as currently written makes the figure layout quite confusing. Rearrange to appear in same order as referenced in text, and/or include slightly longer, more descriptive figure legends to explain the middle row as a kind of positive control, and bottom to more fully explain that the highlighting is for the lineage specificity of genes being included. The focal outgroup species name could also be highlighted in the phylogeny for additional clarity.

-Introduction page 9 - inclusion of large divergence in introductory paragraph- this is potentially confusing and could be clarified, because the meaningful distinction between this kind of divergence and genes for which we fail to find homologs is not obvious. If we can't find a homolog, is that sufficient divergence for the evolution of new function?

- Results paragraph beginning 'As both the sensu stricto yeasts and the drosophilid flies are relatively young lineages': the authors could more clearly lay out their methodology, of including genes specific to lineages with older divergence times in their analysis. The current language of 'testing two additional lineages', seems a bit too brief and may be confusing.

- Method page 12 :

'Proteins in the other yeast and insect species' - this makes it sound like insects were included in the yeast analysis, which is not the case?

'Proteins in the other yeast and insect species satisfying this reciprocal best hit criterion were considered orthologs of the S. cerevisiae and protein' - This is quite unclear, 'protein' meaning the protein in the other yeast proteomes? I think generally this methods paragraph could be made clearer- possibly by breaking this section up into two paragraphs, with one describing data collection and ortholog searches, and one describing the reciprocal best BLAST hits methodology.

Reviewer #4:

[identifies himself as Arne Elofsson]

In this paper, the authors propose a novel method to estimate the probability if no homologs in related species are missed because of rapid sequence evolution. This is then applied to argue that a number of (but not all) earlier proposed de novo created genes most likely are not de novo created. This is a very nice mathematical model and it provides some valuable insights into non optimal assumptions made in earlier papers. However, the results are not that surprising for most people as the problem of fast evolving genes have always been discussed in the context of de novo creation. Anyhow, I think this is a valuable method that should be published, in particular if it could be used to stop still appearing papers assuming that homology detection is very good and we can trace back all protein domain family relationships to Luca.

Major:

One underlying assumption in this study is that an entire gene has a uniform evolutionary rate. This is certainly not a correct assumption, even it for normal genes might be an acceptable assumption. For globular proteins the variation in evolutionary rates the general trend for variation in evolutionary rate between sites is dominated by residues being buried or exposed. However, here the authors focus on genes that for some reason appears to evolve very fast and certainly they are quite different. Quite many of these proteins are intrinsically disordered and this (in addition to a faster evolutionary rate) means that the ration between insertions/deletions vs mutations is different and that the amino acid preferences are different. I would assume that taking site specific evolutionary rate into account would not shift the results significantly (likely a small number of the remaining potentially de novo created genes remains).

Reviewer #5:

[please also see downloadable Word doc]

Virtually all genomes contain genes that lack detectable homologues beyond a certain evolutionary distance. There is high interest in understanding the evolutionary mechanisms that underlie the emergence of these lineage-specific genes. In this manuscript, the authors propose to contrast two classes of evolutionary mechanisms: mechanisms that involve "novelty", such as de novo gene origination or rapid divergence following a duplication event; versus "uneventful" mechanisms, where a steady evolutionary rate eventually leads to the lack of detectability of ancient homologues. The authors present an analytical approach to identify lineage-specific genes that could be explained by a "null model" of homology detection failure due to uneventful evolution at a steady rate. The authors define a simple and elegant mathematical model based on such a scenario and apply it to two lineages, focused on S. cerevisiae and D. melanogaster. They find that potentially a majority of lineage-specific genes could be explained by homology detection failure. The manuscript is very clearly written and has the potential to be a solid contribution to our understanding of the origins of lineage-specific genes and to the wider field of evolutionary genomics. The method itself, freely available on github, could prove very useful for the research community.

Major concerns

The issue of homology detection failure is an important one and has been posed before in the literature. While the authors cite the relevant papers, the manuscript does not sufficiently address how this simulation approach differs from previous ones, why it is better, and to what extent it finds the same or different results. In particular, the simulation approach by Moyers and Zhang was based on a similar premise and was applied to the same two focus species. it would be important to fully consider the similarities and differences between the approaches and compare the results.

Relatedly, the introduction is surprisingly short and does not paint a full picture of the state of the art in the field. Key studies are only mentioned in the results and in the discussion. The manuscript would be greatly improved in the introduction were more complete. What previous studies have addressed the same question? What did they find and how? What evidence is there, apart from absence of similarity, for de novo gene emergence? Furthermore, the dichotomy presented by the authors in the second and third paragraph can be misleading since rapidly evolving duplicates do have homologues, while de novo genes do not. It should be explained more clearly, and put in the context of previous literature which grouped genes with homologues together and contrasted them to de novo genes (eg, Vakirlis et al elife 2020).

Another issue with context concerns what is known about "new" genes. For instance, multiple reports have suggested that they have a higher evolutionary rate than other genes. This is a key piece of context that could drastically impact the validity of the model. Indeed, with the authors approach, such genes will very often, possibly almost always, be explained by the null hypothesis when they are in fact truly "novel". This issue is hard to overcome an should at minimum be acknowledged and discussed in detail as it appears a to be an important limitation.

The authors need to provide solid evidence that their model can truly distinguish between an evolutionary rate that is fast, but constant (null model), and an accelerated rate (novelty model). This would also be helpful to address the above.

The model that the authors propose makes an array of assumptions. For example, we understand that since the predicted bitscores correlate well with the real ones, it's safe to use them. However in the insect dataset there is plenty of variability and a peak seemingly at 0 (supp fig 1). This is important because it shows that, at least in the insects dataset, there are many cases where the model doesn't fit. This of course is expected; after all no model is perfect. But the authors should make sure that there isn't anything particular about these genes that makes them deviate from the model (and whether this could be relevant to lineage-specific genes). This caveat should also be appropriately discussed. The source data for each gene should be made available. Another assumption made is that evolutionary rates are stable across sites; but this is widely known not to be the case. This is another important limitation and should be acknowledged.

Synteny guided similarity searches are presented in two parts in the manuscript. First, for those genes for which homology detection failure is a very likely explanation. In this part, we read that out of the 126 S. cerevisiae genes considered, only 24, so 1/5 have an orthologous locus in at least one outgroup, based on YGOB. First, it should be clear whether that means simply that the syntenic region is identifiable or that there is also a gene present. Given the number of comparisons and the level of conserved synteny in yeasts the former would be surprising. If the latter, the number of cases where the orthologous region can be identified but no gene exists should be provided and commented upon. This is an important control for the authors' main claim: given that these genes are the best candidates for having simply diverged beyond detectability at the level of the clade in question, we would expect that for many of them a candidate homologue would be present in the predicted genomic region. If this isn't the case the authors should discuss why. Second, for those genes for which homology detection failure seems very unlikely. In this part, the out of lineage orthologues are not provided in a supplementary table, as is the case previously. The authors should make these data available.

By design the analyses only focus on the most conserved subset of lineage-specific genes: those with sufficient homology to be able to fit model parameters. Moreover, yeast dubious ORFs are not included although they have been proposed to be enriched for true evolutionary novelty. This should be discussed as it is context for interpreting the main findings of the manuscript, which in several places reads as if it could provide an estimate of what proportion of lineage specific genes are truly novel. In fact, the manuscript estimates proportions based on a restricted search space (that which is conserved enough for the model to be applied). More careful attention to the writing of these conclusions, including in the title, is requested. Note that this limitation to the search space could potentially also explain why the estimates are higher than in previous literature such as Vakirlis et al. 2020, a question raised in the discussion.

Additional questions and minor concerns

- Supp. Tables 2 and 5 are the same file. The legend of Supp. Table 3 does not correspond to what the table actually is. The .eps format for supplementary figures is in general problematic.

- Why include a bitscore data point of the alignment of the protein against itself? This is not justified in the manuscript. How does removing this affect the results?

- Why is the fungi dataset only half as big as the insect one? Since this is a substantial difference it should be appropriately justified.

- Why were orthologues predicted using a method that the authors themselves admit is less than perfect, instead of using pre-computed ones from one of the many available resources of orthologues identified using more sophisticated approaches than RBH (https://questfororthologs.org/orthology_databases) ?

- Some data did not appear to have been made available. For example, the lists of the lineage-specific genes at the various clade levels, the probabilities for individual genes etc. All such data should be made available.

- In the supp tables, species names should be italicized.

Attachment

Submitted filename: review_uploaded.docx

Decision Letter 2

Roland G Roberts

23 Jul 2020

Dear Sean,

Thank you for submitting your revised Research Article entitled "Many but not all lineage-specific genes can be explained by homology detection failure" for publication in PLOS Biology. I have now obtained advice from the original reviewers and have discussed their comments with the Academic Editor.

Based on the reviews, we will probably accept this manuscript for publication, assuming that you will modify the manuscript to address the remaining points raised by the reviewers. Please also make sure to address the data and other policy-related requests noted at the end of this email.

IMPORTANT:

a) You'll see that while reviewers #3 and #4 are now satisfied, reviewer #5 still has some residual requests, mostly pertaining to the relationship between your approach and that of Vakirlis et al. Regarding these points, the Academic Editor says, "I do think the textual revisions need to make clear the deficiencies in the method be dispassionately discussed relative to the previously published method. I do not agree that major point two requires a large scale analysis but I would be in favor of a few cherry picked examples where lowering the number of homologs allows us to see how the method does. In this respect I am contradicting the reviewers but again I think the point can be made illustratively."

b) Please attend to my Data Policy requests further down.

We expect to receive your revised manuscript within two weeks. Your revisions should address the specific points made by each reviewer. In addition to the remaining revisions and before we will be able to formally accept your manuscript and consider it "in press", we also need to ensure that your article conforms to our guidelines. A member of our team will be in touch shortly with a set of requests. As we can't proceed until these requirements are met, your swift response will help prevent delays to publication.

*Copyediting*

Upon acceptance of your article, your final files will be copyedited and typeset into the final PDF. While you will have an opportunity to review these files as proofs, PLOS will only permit corrections to spelling or significant scientific errors. Therefore, please take this final revision time to assess and make any remaining major changes to your manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Early Version*

Please note that an uncorrected proof of your manuscript will be published online ahead of the final version, unless you opted out when submitting your manuscript. If, for any reason, you do not want an earlier version of your manuscript published online, uncheck the box. Should you, your institution's press office or the journal office choose to press release your paper, you will automatically be opted out of early publication. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosbiology/s/submission-guidelines#loc-materials-and-methods

*Submitting Your Revision*

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include a cover letter, a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable), and a track-changes file indicating any changes that you have made to the manuscript.

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli

Roland G Roberts, PhD,

Senior Editor

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 1, 3, 4, 5 and all Supplementary Figs. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #3:

The authors' have addressed all of my original comments, and clarified a number of important points in the manuscript in response to the reviewers. The authors have also included additional analyses which further highlight the utility of their method. I have no further comments, and recommend that the manuscript be accepted for publication.

Reviewer #4:

[identifies himself as Arne Elofsson]

My questions are fully satisfied and I think this paper is ready for publication,

Reviewer #5:

In this revised manuscript, the authors have clarified where the novelty of their work lies relative to the literature: providing a method to test whether a particular lineage-specific gene could be evolving under an evolutionary null model, independently from the syntenic conservation of that gene. The other points discussed in the revised manuscript (that blast based method are not sufficient to identify true evolutionary novelty, and that lineages-secific genes encompass some true novelty and some homology detection failure) are not novel at this date. Now that this is clarified, we agree that this novel method is a promising addition to the field, and would warrant publication if two major changes are made to the manuscript:

1 - in the writing of the manuscript (the whole manuscript, from title to discussion), be clear about the limitations of the method, specifically that it can only be applied to subset of genes: those with at least two known homologues within a lineage, and no known homologues outside of the lineage. In the two lineages studied, this is 155 annotated yeast genes (of 375 annotated lineage specific genes), 167 dubious yeast genes (of 784 dubious yeast genes), 1278 drosophila genes (of 1611 lineage specific drosophila genes). These numbers do not justify the generality of claims made by the authors of having applied their method to "all" lineage-specific genes, nor their quantitative and qualitative conclusions reflected in their title and discussion (including that "nearly all" dubious orfs fail to be detected).

2 - Given that the method, and its application to individual genes, are the key novel points of the manuscript, the authors need to demonstrate that the method indeed works on lineage specific genes. That is, they need to show that the parameter estimation step of their methods yields accurate results when only two homologues in closely related species are used. This can be done by using conserved genes and estimating the parameters when considering all versus only two closely related homologues, and comparing the results. This needs to be done at large scale, rather than on hand picked examples, in order to quantify the success rate of the method. Without such a quantitative evaluation of the method in the context it is meant to be applied to, we do not know if it really is informative.

Additional points:

(new point) Several of the modifications and new analyses presented by the authors seem to points to indels/gaps as a potential major limitation of the approach. It may be difficult to address for this manuscript, but it would be nice to point this out clearly in the discussion and present it as an exciting opportunity for future research in the field.

34. in summarizing the recent work of Vakirlis et al. the authors mention that the method Vakirlis et al applied does not allow to determine which particular lineage-specific genes may be due to homology detection failure. This statement is false. What is true is that that method cannot determine all of such genes across the genome (it only finds those with adequate synteny). But of course, neither does the authors' approach (it only applies to those with two or more homologues).

36. The issue is not adequately addressed in the discussion. The authors need to clarify that "it is possible for a gene to be consistent with the null hypothesis yet nonetheless be a novel lineage specific gene" if, as suggested by multiple reports, novel genes have high evolutionary rates. The discussion, as well as the title of the manuscript, read as if failure to reject the null hypothesis should be interpreted as the gene is not novel because that is the most conservative assumption. Instead, the percentage of genes that are truly novel but fail to reject the null hypothesis (ie, the false negative rate of the author's method) is unknown. This is a general limitation of the method and should be clearly stated as such.

43. This point was not sufficiently addressed in the author's response, so it is is reiterated as the first major comment here. However, the authors have added an additional analysis about Dubious ORFs which is interesting but presents some important limitation in the statistical analysis of the results and in their interpretation.

- Do all genes analyzed have a detectable homologues in S Kud? If not, how can then all the distances be measured relative to S Kud? Then the authors state that "We chose S. kudriavzevii because it is the most distant species from cerevisiae according to our analysis (Figure 2)." But in Figure 2 the most distant Saccharomyces species from cerevisiae is S. bayanus. Is this a typo? Perhaps the methodological explanations can be better described here.

- The conclusions and interpretations are not statistically supported. All comparisons between distributions must be quantified with effect size and p-value in order to make any conclusion.

- The authors make no mention of the evolutionary rate comparison within and outside of microsyntenic regions presented by Vakirlis et al. It is crucial, if the authors want to support their point, to do a fair comparison and discuss their results in the context of the Vakirlis results. The authors could show their distributions separately for genes also included in the Vakirlis et al. dataset and those ~400 not included, and do separate statistics for the two groups. This would immediately show if a bias is present in these genes specifically, or if it's a matter of the different measures being used for evolutionary rate.

46. Our question was referring to the difference in number of species, not number of genes. This question still requires an explanation.

Decision Letter 3

Roland G Roberts

21 Sep 2020

Dear Dr Eddy,

On behalf of my colleagues and the Academic Editor, Harmit S Malik, I am pleased to inform you that we will be delighted to publish your Research Article in PLOS Biology.

The files will now enter our production system. You will receive a copyedited version of the manuscript, along with your figures for a final review. You will be given two business days to review and approve the copyedit. Then, within a week, you will receive a PDF proof of your typeset article. You will have two days to review the PDF and make any final corrections. If there is a chance that you'll be unavailable during the copy editing/proof review period, please provide us with contact details of one of the other authors whom you nominate to handle these stages on your behalf. This will ensure that any requested corrections reach the production department in time for publication.

Early Version

The version of your manuscript submitted at the copyedit stage will be posted online ahead of the final proof version, unless you have already opted out of the process. The date of the early version will be your article's publication date. The final article will be published to the same URL, and all versions of the paper will be accessible to readers.

PRESS

We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have not yet opted out of the early version process, we ask that you notify us immediately of any press plans so that we may do so on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for submitting your manuscript to PLOS Biology and for your support of Open Access publishing. Please do not hesitate to contact me if I can provide any assistance during the production process.

Kind regards,

Alice Musson

Publishing Editor,

PLOS Biology

on behalf of

Roland Roberts,

Senior Editor

PLOS Biology

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Inferred distances in substitutions/site from S. cerevisiae to each yeast species (top) and from D. melanogaster to each insect species (bottom).

    Distances were inferred from all BUSCOs with orthologs identifiable in each species group, as well as from 15 genes randomly selected from these BUSCOs. The “15 BUSCOs subset 1” distances were used for all main figures in the text. BUSCO, Benchmarking Universal Single Copy Ortholog.

    (XLSX)

    S2 Table. Sources of species protein annotations used in this study.

    (XLSX)

    S3 Table. All lineage-specific genes and their values of P(detected | null model, toutgroup) for the 6 lineages (3 fungi, 3 insect) considered here.

    (XLSX)

    S4 Table. Correlation coefficients for gene detectability prediction results based on evolutionary distance estimates derived from 3 different sets of genes (the same as those shown in S1 Fig).

    (XLSX)

    S5 Table. List of 11 S. cerevisiae genes for which synteny-based searches in YGOB revealed candidate out-of-lineage orthologs, the YGOB IDs of those orthologs, and their synteny search E-values.

    YGOB, Yeast Gene Order Browser.

    (XLSX)

    S6 Table. List of sensu stricto–specific S. cerevisiae genes that are poorly explained by the hypothesis of detection failure and their features as described in summary in the text.

    (XLSX)

    S7 Table. List of RefSeq accession IDs for BUSCOs used in evolutionary distance calculations.

    BUSCO, Benchmarking Universal Single Copy Ortholog.

    (XLSX)

    S1 Fig. r2 distributions for the fit to the model of S. cerevisiae and D. melanogaster genes using evolutionary distances derived from 3 sets of genes.

    a: S. cerevisiae genes with distances derived from 102 BUSCOs. b: S. cerevisiae genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in a. c: S. cerevisiae genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in a. d: D. melanogaster genes with distances derived from 125 BUSCOs. e: D. melanogaster genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in d. f: D. melanogaster genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in d. In d-f, the peak near r2 = 0 is comprised of genes with orthologs identifiable only in a subset of the closely related Drosophilid flies, such that their sequences are identical or nearly identical in all species, except 1 or 2 in which a large chunk of the melanogaster protein is absent from the annotation, resulting in almost none of the variance in score (of which there is none, save this large event) being explained by divergence time. We consider this an artifact of the method, as it only appears in the limited cases where the sequences in question are almost totally identical. Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures. BUSCO, Benchmarking Universal Single Copy Ortholog. BUSCO, Benchmarking Universal Single Copy Ortholog.

    (EPS)

    S2 Fig. Distribution of position of BLASTP scores between S. cerevisiae and outgroup yeast (top) and D. melanogaster and outgroup insects (bottom) relative to the predicted confidence interval.

    0 indicates that the score has the same value as the best fit to the model; multiples of sigma indicate that the score is that many standard deviations above or below the best-fit value. a: S. cerevisiae genes with distances derived from 102 BUSCOs. b: S. cerevisiae genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in a. c: S. cerevisiae genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in a. d: D. melanogaster genes with distances derived from 125 BUSCOs. e: D. melanogaster genes with distances derived from a randomly selected subset of 15 of the BUSCOs used in d. f: D. melanogaster genes with distances derived from a second randomly selected subset of 15 of the BUSCOs used in d. Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures. BUSCO, Benchmarking Universal Single Copy Ortholog.

    (EPS)

    S3 Fig. Correlation between best-fit parameters and gene properties in yeast.

    a: Correlation between each S. cerevisiae protein’s best-fit value of a and its length in amino acids. The a parameter is consistently larger than the length due to most identical alignment positions contributing a score larger than 1 according to the scoring scheme used here (BLOSUM62). b: Correlation between each S. cerevisiae protein’s best-fit value of b and its relative evolutionary rate in substitutions per site from sensu stricto protein alignments (Methods). Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures.

    (EPS)

    S4 Fig. Distribution of best-fit parameter values for all S. cerevisiae proteins.

    a: Distribution of the best-fit a values for all S. cerevisiae proteins. b: Distribution of the best-fit b values for all S. cerevisiae proteins. Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures.

    (EPS)

    S5 Fig. Results of “dubious” ORF analysis in S. cerevisiae.

    Top: Distributions of detectability prediction results for all S. cerevisiae lineage-specific genes annotated as of”dubious” coding status in the Saccharomyces Genome Database [41] in 3 yeast lineages (a, b, c). Bottom: Depiction of the lineage (yellow) and closest outgroup (blue) considered in the analyses in the corresponding column. In c), note that Y. lipolytica is the topological outgroup to the shaded lineage, but is not the closest species by evolutionary distance (branch lengths are not to scale). Data used to generate these figures are available at https://github.com/caraweisman/abSENSE/tree/master/Data_for_supplemental_figures. ORF, open reading frame.

    (EPS)

    S1 Supporting Information. Supplemental information.

    Justification for the functional form of the model; effect of site-specific selection pressure; analysis of data in Vakirlis and colleagues (2020).

    (DOCX)

    Attachment

    Submitted filename: review_uploaded.docx

    Attachment

    Submitted filename: reviewer_response.pdf

    Attachment

    Submitted filename: reviewer_response.pdf

    Data Availability Statement

    All data used in these analyses and the scripts necessary to reproduce them are available in the Supporting information and on our code repository at http://www.github.com/caraweisman/abSENSE.


    Articles from PLoS Biology are provided here courtesy of PLOS

    RESOURCES