Skip to main content
PLOS Biology logoLink to PLOS Biology
. 2022 Oct 13;20(10):e3001827. doi: 10.1371/journal.pbio.3001827

OrthoSNAP: A tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees

Jacob L Steenwyk 1,2,*, Dayna C Goltz 3, Thomas J Buida III 3, Yuanning Li 1,2,4, Xing-Xing Shen 5, Antonis Rokas 1,2,6,*
Editor: Andreas Hejnol7
PMCID: PMC9595520  PMID: 36228036

Abstract

Molecular evolution studies, such as phylogenomic studies and genome-wide surveys of selection, often rely on gene families of single-copy orthologs (SC-OGs). Large gene families with multiple homologs in 1 or more species—a phenomenon observed among several important families of genes such as transporters and transcription factors—are often ignored because identifying and retrieving SC-OGs nested within them is challenging. To address this issue and increase the number of markers used in molecular evolution studies, we developed OrthoSNAP, a software that uses a phylogenetic framework to simultaneously split gene families into SC-OGs and prune species-specific inparalogs. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they are identified using a splitting and pruning procedure analogous to snapping branches on a tree. From 415,129 orthologous groups of genes inferred across 7 eukaryotic phylogenomic datasets, we identified 9,821 SC-OGs; using OrthoSNAP on the remaining 405,308 orthologous groups of genes, we identified an additional 10,704 SNAP-OGs. Comparison of SNAP-OGs and SC-OGs revealed that their phylogenetic information content was similar, even in complex datasets that contain a whole-genome duplication, complex patterns of duplication and loss, transcriptome data where each gene typically has multiple transcripts, and contentious branches in the tree of life. OrthoSNAP is useful for increasing the number of markers used in molecular evolution data matrices, a critical step for robustly inferring and exploring the tree of life.


Molecular evolution studies often rely on single-copy orthologs. This study presents OrthoSNAP, an algorithm that identifies and extracts additional single-copy orthologs nested within larger gene families; OrthoSNAP greatly increases the number of orthologs available for downstream molecular evolution analyses, while retaining phylogenetic information content.

Introduction

Molecular evolution studies, such as species tree inference, genome-wide surveys of selection, evolutionary rate estimation, measures of gene–gene coevolution, and others typically rely on single-copy orthologs (SC-OGs), a group of homologous genes that originated via speciation and are present in single copy among species of interest [16]. In contrast, paralogs—homologous genes that originated via duplication and are often members of large gene families—are typically absent from these studies (Fig 1). Gene families of orthologs and paralogs often encode functionally significant proteins such as transcription factors, transporters, and olfactory receptors [710]. The exclusion of SC-OGs from gene families has not only hindered our understanding of their evolution and phylogenetic informativeness but is also artificially reducing the number of gene markers available for molecular evolution studies. Furthermore, as the number of species and/or their evolutionary divergence increases in a dataset, the number of SC-OGs decreases [11,12]; case in point, no SC-OGs were identified in a dataset of 42 plants [11]. As the number of available genomes across the tree of life continues to increase, our ability to identify SC-OGs present in many taxa will become more challenging.

Fig 1. Cartoon depiction of 3 classes of paralogs: outparalogs, inparalogs, and coorthologs.

Fig 1

(A) Paralogs refer to related genes that have originated via gene duplication, such as genes M, N, and O. (B) Outparalogs and inparalogs refer to paralogs that are related to one another via a duplication event that took place prior to or after a speciation event, respectively. With respect to the speciation event that led to the split of taxa A, B, and C from D, genes M, N, and O are outparalogs because they arose prior to the speciation event; genes O1 and O2 in taxa A, B, and C are inparalogs because they arose after the speciation event. Species-specific inparalogs are paralogous genes observed only in 1 species, strain, or organism in a dataset, such as gene N1 and N2 in species A. Species-specific inparalogs N1 and N2 in species A are also coorthologs of gene N in taxa B, C, and D; the same is true for inparalogs O1 and O2 from species A, which are coorthologs of gene O from species D. (C) Cartoon depiction of SNAP-OGs identified by OrthoSNAP.

In light of these issues, several methods have been developed to account for paralogs in specific types of molecular evolution studies—for example, in species tree reconstruction [13]. Methods such as SpeciesRax, STAG, ASTRAL-PRO, and DISCO can be used to infer a species tree from a set of SC-OGs and gene families composed of orthologs and paralogs [11,1416]. Other methods such as PHYLDOG [17] and guenomu [18] jointly infer the species and gene trees but require abundant computational resources, which has hindered their use for large datasets. Other software, such as PhyloTreePruner, can conduct species-specific inparalog trimming [19]. Agalma, as part of a larger automated phylogenomic workflow, can prune gene trees into maximally inclusive subtrees wherein each species, strain, or organism is represented by 1 sequence [20]. Similarly, OMA identifies subgroups of SC-OGs using graph-based clustering of sequence similarity scores [21]. Although these methods have expanded the numbers of gene markers used in species tree reconstruction, they were not designed to facilitate the retrieval of as broad a set of SC-OGs as possible for downstream molecular evolution studies such as surveys of selection. Furthermore, the phylogenetic information content of these gene families remains unknown, calling into question their usefulness.

To address this need and measure the information content of subgroups of single-copy orthologous genes, we developed OrthoSNAP, a novel algorithm that identifies SC-OGs nested within larger gene families via tree decomposition and species-specific inparalog pruning. We term SC-OGs identified by OrthoSNAP as SNAP-OGs because they were retrieved using a splitting and pruning procedure. The efficacy of OrthoSNAP and the information content of SNAP-OGs was examined across 7 eukaryotic datasets, which include species with complex evolutionary histories (e.g., whole-genome duplication) or complex gene sequence data (e.g., transcriptomes, which typically have multiple transcripts per protein-coding gene). These analyses revealed OrthoSNAP can substantially increase the number of orthologs for downstream analyses such as phylogenomics and surveys of selection. Furthermore, we found that the information content of SNAP-OGs was statistically indistinguishable from that of SC-OGs suggesting the inclusion of SNAP-OGs in downstream analyses is likely to be as informative. These analyses indicate that SNAP-OGs identified by OrthoSNAP hold promise for increasing the number of markers used in molecular evolution studies, which can, in turn, be used for constructing and interpreting the tree of life.

Results

OrthoSNAP is a novel tree traversal algorithm that conducts tree splitting and species-specific inparalog pruning to identify SC-OGs nested within larger gene families (Fig 1C). OrthoSNAP takes as input a gene family phylogeny and associated FASTA file and can output individual FASTA files populated with sequences from SNAP-OGs as well as the associated Newick tree files (Fig 2). During tree traversal, tree uncertainty can be accounted for by OrthoSNAP by collapsing poorly supported branches. In a set of 7 eukaryotic datasets that contained 9,821 SC-OGs, we used OrthoSNAP to identify an additional 10,704 SNAP-OGs. Using a combination of multivariate statistics and phylogenetic measures, we demonstrate that SNAP-OGs and SC-OGs have similar phylogenetic information content in all 7 datasets. This observation was consistent across datasets where the identification of large numbers of SC-OGs is challenging: flowering plants that have complex patterns of gene duplication and loss (15 SC-OGs and 653 SNAP-OGs), a lineage of budding yeasts wherein half of the species have undergone an ancient whole-genome duplication event (2,782 SC-OGs and 1,334 SNAP-OGs), and a dataset of transcriptomes where many genes are represented by multiple transcripts (390 SC-OGs and 2,087 SNAP-OGs). Lastly, similar patterns of support were observed among the 252 SC-OGs and the 1,428 SNAP-OGs in a contentious branch in the tree of life. Taken together, these results suggest that OrthoSNAP is helpful for expanding the set of gene markers available for molecular evolutionary studies, even in datasets where inference of orthology has historically been difficult due to complex evolutionary history or complex data characteristics.

Fig 2. Cartoon depiction of OrthoSNAP workflow.

Fig 2

(A) OrthoSNAP takes as input 2 files: a FASTA file of a gene family with multiple homologs observed in 1 or more species and the associated gene family tree. The outputted file(s) will be individual FASTA files of SNAP-OGs. Depending on user arguments, individual Newick tree files can also be outputted. (B) A cartoon phylogenetic tree that depicts the evolutionary history of a gene family and 5 SNAP-OGs therein. While identifying SNAP-OGs, OrthoSNAP also identifies and prunes species-specific inparalogs (e.g., species2|gene2-copy_0 and species2|gene2-copy_1), retaining only the inparalog with the longest sequence, a practice common in transcriptomics. Note, OrthoSNAP requires that sequence naming schemes must be the same in both sequences and follow the convention in which a species, strain, or organism identifier and gene identifier are separated by pipe (or vertical bar; “|”) character.

SC-OGs and SNAP-OGs have similar information content

To compare SC-OGs and SNAP-OGs, we first independently inferred orthologous groups of genes in 3 eukaryotic datasets of 24 budding yeasts (none of which have undergone whole-genome duplication), 36 filamentous fungi (Aspergillus and Penicillium species), and 26 mammals including humans, dogs, pigs, elephants, sloths, and others (S1 Table). There was variation in the number of SC-OGs and SNAP-OGs in each lineage (S1 Fig and S2 Table). Interestingly, the ratio of SNAP-OGs: SC-OGs among budding yeasts, filamentous fungi, and mammals was 0.83 (1,392: 1,668), 0.46 (2,035: 4,393), and 5.53 (1,775: 321), respectively, indicating SNAP-OGs can substantially increase the number of gene markers in certain lineages. The number of SNAP-OGs identified in a gene family with multiple homologs in 1 or more species also varied (S2 Fig).

Similar orthogroup occupancy and best-fitting models of substitutions were observed among SC-OGs and SNAP-OGs (S3 Fig and S3 Table), raising the question of whether SC-OGs and SNAP-OGs have similar information content. To answer this, the information content among multiple sequence alignments and phylogenetic trees from SC-OGs and SNAP-OGs (S4 Fig and S4 Table) was compared across 9 properties—Robinson–Foulds distance [22], relative composition variability [23], and average bootstrap support, for example—using multivariate analysis and statistics as well as information theory-based phylogenetic measures. Principal component analysis enabled qualitative comparisons between SC-OGs and SNAP-OGs in reduced dimensional space and revealed a high degree of similarity (Figs 3 and S5). Multivariate statistics—namely, multifactor analysis of variance—facilitated a quantitative comparison of SC-OGs and SNAP-OGs and revealed no difference between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1; S5 Table) and no interaction between the 9 properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Similarly, multifactor analysis of variance using an additive model, which assumes each factor is independent and there are no interactions (as observed here), also revealed no differences between SC-OGs and SNAP-OGs (p = 0.65, F = 0.21, df = 1). Next, we calculated tree certainty, an information theory-based measure of tree congruence from a set of gene trees, and found similar levels of congruence among phylogenetic trees inferred from SC-OGs and SNAP-OGs (S6 Table). Taken together, these analyses demonstrate that SC-OGs and SNAP-OGs have similar phylogenetic information content.

Fig 3. SC-OGs and SNAP-OGs have similar phylogenetic information content.

Fig 3

To evaluate similarities and differences between SC-OGs (orange dots) and SNAP-OGs (blue dots), we examined each gene’s phylogenetic information content by measuring 9 properties of multiple-sequence alignments and phylogenetic trees. We performed these analyses on 12,764 gene families from 3 datasets—24 budding yeasts (1,668 SC-OGs and 1,392 SNAP-OGs) (A), 36 filamentous fungi (4,393 SC-OGs and 2,035 SNAP-OGs) (B), and 26 mammals (321 SC-OGs and 1,775 SNAP-OGs) (C). Principal component analysis revealed striking similarities between SC-OGs and SNAP-OGs in all 3 datasets. For example, the centroid (i.e., the mean across all metrics and genes) for SC-OGs and SNAP-OGs, which is depicted as an opaque and larger dot, are very close to one another in reduced dimensional space. Supporting this observation, multifactor analysis of variance with interaction effects of the 6,630 SNAP-OGs and 6,634 SC-OGs revealed no difference between SC-OGs and SNAP-OGs (p = 0.63, F = 0.23, df = 1) and no interaction between the 9 properties and SC-OGs and SNAP-OGs (p = 0.16, F = 1.46, df = 8). Multifactor analysis of variance using an additive model yielded similar results wherein SC-OGs and SNAP-OGs do not differ (p = 0.65, F = 0.21, df = 1). There are also very few outliers of individual SC-OGs and SNAP-OGs, which are represented as translucent dots, in all 3 panels. For example, SNAP-OGs outliers at the top of panel C are driven by high treeness/RCV values, which is associated with a high signal-to-noise ratio and/or low composition bias [23]; SNAP-OG outliers at the right of panel C are driven by high average bootstrap support values, which is associated with greater tree certainty [74]; and the single SC-OG outlier observed in the bottom right of panel C is driven by a SC-OG with a high degree of violation of a molecular clock [78], which is associated with lower tree certainty [79]. Multiple-sequence alignment and phylogenetic tree properties used in principal component analysis and abbreviations thereof are as follows: average bootstrap support (ABS), degree of violation of the molecular clock (DVMC), relative composition variability, Robinson–Foulds distance (RF distance), alignment length (Aln. len.), the number of parsimony informative sites (PI sites), saturation, treeness (tness), and treeness/RCV (tness/RCV). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

We next aimed to determine if SC-OGs and SNAP-OGs have greater phylogenetic information content than a random null expectation. Groups of genes reflecting a random null expectation were constructed by randomly selecting a single sequence from representative species in multicopy orthologous genes (hereafter referred to as Random-GGs for random combinations of orthologous and paralogous groups of genes) in the budding yeast (N = 647), filamentous fungi (N = 999), and mammalian (N = 954) datasets. Random-GGs were aligned, trimmed, and phylogenetic trees were inferred from the resulting multiple sequence alignments. Random-GG phylogenetic information was also calculated. Across each dataset, significant differences were observed among SC-OGs, SNAP-OGs, and Random-GGs (p < 0.001, F = 189.92, df = 4; Multifactor ANOVA). Further examination of differences revealed Random-GGs are significantly different compared to SC-OGs and SNAP-OGs (p < 0.001 for both comparisons; Tukey honest significant differences (THSD) test) in the budding yeast dataset. In contrast, SC-OGs and SNAP-OGs are not significantly different (p = 0.42; THSD). The same was also true for the dataset of filamentous fungi and mammals—specifically, Random-GGs were significantly different from SC-OGs and SNAP-OGs (p < 0.001 for each comparison in each dataset; THSD), whereas SC-OGs and SNAP-OGs were not significantly different (p = 1.00 for filamentous fungi dataset; p = 0.42 for dataset of mammals; THSD). Principal component analysis revealed Robinson–Foulds distances (a measure of tree accuracy wherein lower values represent greater tree accuracy), and relative composition variability (a measure of alignment composition bias wherein lower values represent less compositional bias), often drove differences among Random-GGs, SC-OGs, and SNAP-OGs across the datasets. In all datasets, SC-OGs and SNAP-OGs outperformed the null expectation in tree accuracy and were less compositionally biased (Table 1). These findings suggest SNAP-OGs and SC-OGs are similar in phylogenetic information content and outperform the null expectation.

Table 1. SC-OGs and SNAP-OGs are more accurate and have less compositional biases than Random-GGs.

Dataset OG type RF distance RCV
Budding yeasts SC-OGs 0.19 ± 0.12 0.19 ± 0.05
SNAP-OGs 0.18 ± 0.11 0.18 ± 0.06
Random-GGs 0.65 ± 0.27 0.27 ± 0.13
Filamentous fungi SC-OGs 0.27 ± 0.13 0.12 ± 0.05
SNAP-OGs 0.27 ± 0.12 0.12 ± 0.06
Random-GGs 0.87 ± 0.11 0.21 ± 0.13
Mammals SC-OGs 0.56 ± 0.22 0.13 ± 0.06
SNAP-OGs 0.51 ± 0.23 0.11 ± 0.07
Random-GGs 0.61 ± 0.30 0.15 ± 0.10

The first column is the dataset being examined. The second column describes the type of group of genes. The third column is the Robinson–Foulds distances, a measure of tree distance wherein higher values reflect greater inaccuracies. The fourth column is the relative composition variability, a measure of alignment composition bias wherein higher values indicate greater biases. In all datasets, SC-OGs and SNAP-OGs had better scores compared to a null expectation.

RCV, relative composition variability; RF, Robinson–Foulds distance; SC-OG, single-copy ortholog.

Values represent mean and standard deviations.

SC-OGs and SNAP-OGs have similar performances in complex datasets

Complex biological processes and datasets pose a serious challenge for identifying markers for molecular evolution studies. To test the efficacy of OrthoSNAP in scenarios of complex evolutionary histories and datasets, we executed the same workflow described above—ortholog calling, sequence alignment, trimming, tree inference, and SNAP-OG detection—on 3 new datasets: (1) 30 plants known to have complex histories of gene duplication and loss [2426]; (2) 30 budding yeast species wherein half of the species originated from a hybridization event that gave rise to a whole-genome duplication followed by complex patterns of loss and duplication [2730]; and (3) 20 choanoflagellate transcriptomes, which contain thousands more transcripts than genes [31,32]; for orthology inference software, multiple transcripts per gene appear similar to artificial gene duplicates.

Corroborating previous results, OrthoSNAP successfully identified SNAP-OGs that can be used downstream for molecular evolution analyses. Specifically, using a species-occupancy threshold of 50% in the plant, budding yeast, and choanoflagellate datasets, 653, 1,334, and 2,087 SNAP-OGs were identified, respectively (Table 2). In comparison, 15 SC-OGs were identified in the plant dataset; 2,782 in the budding yeast dataset; and 390 in the choanoflagellate dataset. (Note that there are likely more SC-OGs than SNAP-OGs in budding yeasts because their genomes are relatively small and therefore do not have as many duplicate gene copies compared to other lineages, such as plants. Nonetheless, OrthoSNAP still substantially increases the number of markers in a phylogenomic data matrix.) To explore the impact of orthogroup occupancy, SNAP-OGs were also identified using a minimum occupancy threshold of 4 taxa. This resulted in the identification of substantially more SNAP-OGs: 15,854 in plants; 4,199 in budding yeasts; and 11,556 in choanoflagellates. Furthermore, these were substantially higher than the number of SC-OGs identified using a minimum orthogroup occupancy of 4 taxa: 200 in plants; 3,566 in budding yeasts; and 2,438 in choanoflagellates. These findings support previous observations that incorporating OrthoSNAP into ortholog identification workflows can substantially increase the number of available loci.

Table 2. OrthoSNAP identifies SNAP-OGs in complex datasets.

Dataset Challenge Total OGs SC-OGs (50% min. occupancy threshold) SNAP-OGs (50% min. occupancy threshold) SC-OGs (4 species min. threshold) SNAP-OGs (4 species min. threshold)
Plants (N = 30) Evolutionary histories with extensive gene duplication and loss events 83,034 15 653 200 15,854
Budding yeasts (N = 30) Half of the species used experienced hybridization and whole-genome duplication followed by extensive loss of paralogs 11,422 2,782 1,334 3,566 4,199
Choanoflagellates (N = 20) Transcriptomes, where often multiple transcripts correspond to a single protein-coding gene 274,028 390 2,087 2,438 11,556

SC-OG identification can be difficult due to complex evolutionary histories (e.g., hybridization, whole-genome duplication, and complex patterns of gene duplication and loss such as in the datasets of budding yeasts and plants) and analytical artifacts (e.g., transcriptomes with more transcripts than genes such as the choanoflagellate dataset). OrthoSNAP successfully identified SNAP-OGs in each dataset. Lowering the occupancy threshold of a SNAP-OG to a minimum of 4 enabled the identification of substantially more SNAP-OGs.

SC-OGs and SNAP-OGs have similar patterns of support in a contentious branch in the tree of life

To further evaluate the information content of SNAP-OGs, we compared patterns of support among SC-OGs and SNAP-OGs in a difficult-to-resolve branch in the tree of life. Specifically, we evaluated the support between 3 hypotheses concerning deep evolutionary relationships among eutherian mammals: (1) Xenarthra (eutherian mammals from the Americas) and Afrotheria (eutherian mammals from Africa) are sister to all other Eutheria [33,34]; (2) Afrotheria are sister to all other Eutheria [35,36]; and (3) Xenarthra are sister to a clade of both Afrotheria and Eutheria (Fig 4A). Resolution of this conflict has important implications for understanding the historical biogeography of these organisms. To do so, we first obtained protein-coding gene sequences from 6 Afrotheria, 2 Xenarthra, 12 other Eutheria, and 8 outgroup taxa from NCBI (S7 Table), which represent all annotated and publicly genome assemblies at the time of this study (S8 Table). Using the protein translations of these gene sequences as input to OrthoFinder, we identified 252 SC-OGs shared across taxa; application of OrthoSNAP identified an additional 1,428 SNAP-OGs, which represents a greater than 5-fold increase in the number of gene markers for this dataset (S8 Table). There was variation in the number of SNAP-OGs identified per orthologous group of genes (S6 Fig). The highest number of SNAP-OGs identified in an orthologous group of genes was 10, which was a gene family of olfactory receptors; olfactory receptors are known to have expanded in the evolutionary history of eutherian mammals [8]. The best-fitting substitution models were similar between SC-OGs and SNAP-OGs (S7 Fig).

Fig 4. SC-OGs and SNAP-OGs display similar patterns of support in a contentious branch concerning deep evolutionary relationships among eutherian mammals.

Fig 4

(A) Two leading hypotheses for the evolutionary relationships among Eutheria, which have implications for the evolution and biogeography of the clade, are that Afrotheria and Xenarthra are sister to all other Eutheria (hypothesis 1; blue) and that Afrotheria are sister to all other Eutheria (hypothesis 2; pink). The third possible, but less well-supported topology, is that Xenarthra are sister to Eutheria and Afrotheria. (B) Comparison of gene support frequency (GSF) values for the 3 hypotheses among 252 SC-OGs and 1,428 SNAP-OGs using an α level of 0.01 revealed no differences in support (p = 0.26, Fisher’s exact test with Benjamini–Hochberg multitest correction). Comparison after accounting for gene tree uncertainty by collapsing bipartitions with ultrafast bootstrap approximation support lower than 75 (SC-OGs collapsed vs. SNAP-OGs collapsed) also revealed no differences (p = 0.05; Fisher’s exact test with Benjamini–Hochberg multitest correction). (C) Examination of the distribution of frequency of topology support using gene-wise log-likelihood scores revealed no difference between SNAP-OGs and SC-OGs support for the 3 topologies (p = 0.52; Fisher’s exact test). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

Two independent tests examining support between alternative hypotheses of deep evolutionary relationships among eutherian mammals revealed similar patterns of support between SC-OGs and SNAP-OGs. More specifically, no differences were observed in gene support frequencies—the number of genes that support 1 of 3 possible hypotheses at a given branch in a phylogeny—without or with accounting for single-gene tree uncertainty by collapsing branches with low support values (p = 0.26 and p = 0.05, respectively; Fisher’s exact test with Benjamini–Hochberg multitest correction; Fig 4B and S9 Table). A second test of single-gene support was conducted wherein individual gene log likelihoods were calculated for each of the 3 possible topologies. The frequency of gene-wise support for each topology was determined. No differences were observed in gene support frequency using the log likelihood approach (p = 0.52, respectively; Fisher’s exact test). Examination of patterns of support in a contentious branch in the tree of life using 2 independent tests revealed SC-OGs and SNAP-OGs are similar and further supports the observation that they contain similar phylogenetic information.

In summary, 415,129 orthologous groups of genes across 7 eukaryotic datasets contained 9,821 SC-OGs; application of OrthoSNAP identified an additional 10,704 SNAP-OGs, thereby more than doubling the number of gene markers. Comprehensive comparison of the phylogenetic information content among SC-OGs and SNAP-OGs revealed no differences in phylogenetic information content. Strikingly, this observation held true across datasets with complex evolutionary histories and when conducting hypothesis testing in a difficult-to-resolve branch in the tree of life. These findings suggest that SNAP-OGs may be useful for diverse studies of molecular evolution ranging from genome-wide surveys of selection, phylogenomic investigations, gene–gene coevolution analyses, and others.

Discussion

Molecular evolution studies typically rely on SC-OGs. Recently, developed methods can integrate gene families of orthologs and paralogs into species tree inference but are not designed to broadly facilitate the retrieval of gene markers for molecular evolution analyses. Furthermore, the phylogenetic information content of gene families of orthologs and paralogs remains unknown. This observation underscores the need for algorithms that can identify SC-OGs nested within larger gene families, which can, in turn, be incorporated into diverse molecular evolution analyses, and a comprehensive assessment of their phylogenetic properties.

To address this need, we developed OrthoSNAP, a tree splitting and pruning algorithm that identifies SNAP-OGs, which refers to SC-OGs nested within larger gene families wherein species-specific inparalogs have also been pruned. Comprehensive examination of the phylogenetic information content of SNAP-OGs and SC-OGs from 7 empirical datasets of diverse eukaryotic species revealed that their content is similar. Inclusion of SNAP-OGs increased the size of all 7 datasets, sometimes substantially. We note that our results are qualitatively similar to those reported recently by Smith and colleagues [37], which retrieved SC-OGs nested within larger families from 26 primates and examined their performance in gene tree and species tree inference. Three noteworthy differences are that we also conduct species-specific inparalog trimming, provide a user-friendly command-line software for SNAP-OG identification, and evaluated the phylogenetic information content of SNAP-OGs and SC-OGs across 7 diverse phylogenomic datasets. We also note that our algorithm can account for diverse types of paralogy—outparalogs, inparalogs, and species-specific inparalogs—whereas other software like PhyloTreePruner, which only conducts species-specific inparalog trimming [19], and Agalma, which identifies single-copy outparalogs and inparalogs [20], can account for some, but not all, types of paralogs (S10 Table). Another difference between OrthoSNAP and other approaches is that Agalma and PhyloTreePruner both require rooted phylogenies. In contrast, OrthoSNAP will automatically midpoint root phylogenies or accept prerooted phylogenies as input. Furthermore, these algorithms are not designed to handle transcriptomic data wherein multiple transcripts per gene will be interpreted as multicopy orthologs. Thus, OrthoSNAP allows for greater user flexibility and accounts for more diverse scenarios, leading to, at least in some instances, the identification of more loci for downstream analyses (S8 Fig). Notably, these software are also different from sequence similarity graph-based inferences of subgroups of single-copy orthologous genes—such as the algorithm implemented in OMA [21]. In other words, OrthoSNAP identifies subgroups of single-copy orthologous genes by examining evolutionary histories, rather than sequence similarity values. Moreover, examination of evolutionary histories facilitates the identification of species-specific inparalogs. Finally, our results, together with other studies, demonstrate the utility of SC-OGs that are nested within larger families [15,20,37,38].

Despite the ability of OrthoSNAP to identify additional loci for molecular evolution analyses, there were instances wherein SNAP-OGs were not identified in multicopy orthologous groups of genes. We discuss 3 reasons that contribute to why SNAP-OGs could not be identified among some genes—specifically, gene families with sequence data from <50% of the taxa; gene families with complex evolutionary histories (for example, HGT and duplication/loss patterns); and gene families with evolutionary histories that differ from the species tree (for example, due to analytical factors, such as sampling and systematic error, or biological factors, such as lineage sorting or introgression/hybridization [3941]). Notably, the first reason can, but does not always, result in inability to infer SNAP-OGs and can be, to a certain extent, addressed (e.g., by lowering the orthogroup occupancy threshold in OrthoSNAP), whereas the other 2 reasons are more challenging because they often result in a genuine absence of SC-OGs. Furthermore, the actual number of SC-OGs (either those nested within multicopy orthologs or not) for any given group of organisms is not known, making it difficult to determine how many SNAP-OGs and SC-OGs one should expect to recover. Notably, this issue has long challenged researchers, even when ortholog identification is performed by also taking genome synteny into account [27].

Next, we discuss some practical considerations when using OrthoSNAP. In the present study, we inferred orthology information using OrthoFinder [42], but several other approaches can be used upstream of OrthoSNAP. For example, other graph-based algorithms such as OrthoMCL and OMA [21,43] or sequence similarity-based algorithms such as orthofisher [44] can be used to infer gene families. Similarly, sequence similarity search algorithms like BLAST+ [45], USEARCH [46], and HMMER [47] can be used to retrieve homologous sets of sequences that are used as input for OrthoSNAP. Other considerations should also be taken during the multicopy tree inference step. For example, inferring phylogenies for all orthologous groups of genes may be a computationally expensive task. Rapid tree inference software—such as FastTree or IQTREE with the “-fast” parameter [48,49]—may expedite these steps (but users should be aware that this may result in a loss of accuracy in inference; [50]).

We suggest employing “best practices” when inferring groups of putatively orthologous genes, including SNAP-OGs. Specifically, orthology information can be further scrutinized using phylogenetic methods. Orthology inference errors may occur upstream of OrthoSNAP; for example, SNAP-OGs may be susceptible to erroneous inference of orthology during upstream clustering of putatively orthologous genes. One method to identify putatively spurious orthology inference is by identifying long terminal branches [51]. Terminal branches of outlier length can be identified using the “spurious_sequence” function in PhyKIT [52]. Other tools, such as PhyloFisher, UPhO, and other orthology inference pipelines employ similar strategies to refine orthology inference [5355]. Lastly, we acknowledge that future iterations of OrthoSNAP may benefit from incorporating additional layers of information, such as sequence similarity scores or synteny. Even though OrthoSNAP did identify SNAP-OGs in some complex datasets where synteny has previously been very helpful, such as the budding yeast dataset, other ancient and rapidly evolving lineages may benefit from synteny analysis to dissect complex relationships of orthology [51,5658].

Taken together, we suggest that OrthoSNAP is useful for retrieving single-copy orthologous groups of genes from gene family data and that the identified SNAP-OGs have similar phylogenetic information content compared to SC-OGs. In combination with other phylogenomic toolkits, OrthoSNAP may be helpful for reconstructing the tree of life and expanding our understanding of the tempo and mode of evolution therein.

Methods

OrthoSNAP availability and documentation

OrthoSNAP is available under the MIT license from GitHub (https://github.com/JLSteenwyk/orthosnap), PyPi (https://pypi.org/project/orthosnap), and the Anaconda cloud (https://anaconda.org/JLSteenwyk/orthosnap). OrthoSNAP is also freely available to use via the LatchBio (https://latch.bio/) cloud-based console (dedicated interface link: https://console.latch.bio/explore/65606/info). Documentation describes the OrthoSNAP algorithm, parameters, and provides user tutorials (https://jlsteenwyk.com/orthosnap).

OrthoSNAP algorithm description and usage

We next describe how OrthoSNAP identifies SNAP-OGs. OrthoSNAP requires 2 files as input: one is a FASTA file that contains 2 or more homologous sequences in 1 or more species and the other the corresponding gene family phylogeny in Newick format. In both the FASTA and Newick files, users must follow a naming scheme—wherein species, strain, or organism identifiers and gene sequences identifiers are separated by a vertical bar (also known as a pipe character or “|”)—which allows OrthoSNAP to determine which sequences were encoded in the genome of each species, strain, or organism. After initiating OrthoSNAP, the gene family phylogeny is first midpoint rooted (unless the user specifies the inputted phylogeny is already rooted) and then SNAP-OGs are identified using a tree-traversal algorithm. To do so, OrthoSNAP will loop through the internal branches in the gene family phylogeny and evaluate the number of distinct taxa identifiers among children terminal branches. If the number of unique taxon identifiers is greater than or equal to the orthogroup occupancy threshold (default: 50% of total taxa in the inputted phylogeny; users can specify an integer threshold), then all children branches and termini are examined further; otherwise, OrthoSNAP will examine the next internal branch. Next, OrthoSNAP will collapse branches with low support (default: 80, which is motivated by using ultrafast bootstrap approximations [59] to evaluate bipartition support; users can specify an integer threshold) and conduct species-specific inparalog trimming wherein the longest sequence is maintained, a practice common in transcriptomics. However, users can specify whether the shortest sequence or the median sequence (in the case of 3 or more sequences) should be kept instead. Users can also pick which species-specific inparalog to keep based on branch lengths (the longest, shortest, or median branch length in the case of having 3 or more sequences). Species-specific inparalogs are defined as sequences encoded in the same genome that are sister to one another or belong to the same polytomy [19]. The resulting set of sequences is examined to determine if 1 species, strain, or organism is represented by 1 sequence and ensure these sequences have not yet been assigned to a SNAP-OG. If so, they are considered a SNAP-OG; if not, OrthoSNAP will examine the next internal branch. When SNAP-OGs are identified, FASTA files of SNAP-OG sequences are outputted. Users can also output the subtree of the SNAP-OG using an additional argument.

The principles of the OrthoSNAP algorithm are also described using the following pseudocode:

FOR internal branch in midpoint rooted gene family phylogeny:

  • > IF orthogroup occupancy among children termini is greater than or equal to orthogroup occupancy threshold;

  • >> Collapse poorly supported bipartitions and trim species-specific inparalogs;

  • >> IF each species, strain, or organism among the trimmed set of species, strains, or organisms is represented by only one sequence and no sequences being examined have been assigned to a SNAP-OG yet;

  • >>> Sequences represent a SNAP-OG and are outputted to a FASTA file

  • >> ELSE

  • >>> examine next internal branch

  • > ELSE

  • >> examine next internal branch

ENDFOR

To enhance the user experience, arguments or default values are printed to the standard output, a progress bar informs the user of how of the analysis has been completed, and the number of SNAP-OGs identified as well as the names of the outputted FASTA files are printed to the standard output.

Development practices and design principles to ensure long-term software stability

Archival instabilities among software threatens the reproducibility of bioinformatics research [60]. To ensure long-term stability of OrthoSNAP, we implemented previously established rigorous development practices and design principles [44,52,61,62]. For example, OrthoSNAP features a refactored codebase, which facilitates debugging, testing, and future development. We also implemented a continuous integration pipeline to automatically build, package, and install OrthoSNAP across Python versions 3.7, 3.8, and 3.9. The continuous integration pipeline also conducts 57 unit and integration tests, which span 95.90% of the codebase and ensure faithful function of OrthoSNAP.

Dataset generation

To generate a dataset for identifying SNAP-OGs and comparing them to SC-OGs, we first identified putative groups of orthologous genes across 4 empirical datasets. To do so, we first downloaded proteomes for each dataset, which were obtained from publicly available repositories on NCBI (S1 and S7 Tables) or figshare [51]. Each dataset varied in its sampling of sequence diversity and in the evolutionary divergence of the sampled taxa. The dataset of 24 budding yeasts spans approximately 275 million years of evolution [51]; the dataset of 36 filamentous fungi spans approximately 94 million years of evolution [63]; the dataset of 26 mammals spans approximately 160 million years of evolution [64]; and the dataset of 28 eutherian mammals—which was used to study the contentious deep evolutionary relationships among eutherian mammals—concerns an ancient divergence that occurred approximately 160 million years ago [65]. Putatively orthologous groups of genes were identified using OrthoFinder, v2.3.8 [42], with default parameters, which resulted in 46,645 orthologous groups of genes with at least 50% orthogroup occupancy (S8 Table).

To infer the evolutionary history of each orthologous group of genes, we first individually aligned and trimmed each group of sequences using MAFFT, v7.402 [66], with the “auto” parameter and ClipKIT, v1.1.3 [61], with the “smart-gap” parameter, respectively. Thereafter, we inferred the best-fitting substitution model using Bayesian information criterion and evolutionary history of each orthologous group of genes using IQ-TREE2, v2.0.6 [49]. Bipartition support was examined using 1,000 ultrafast bootstrap approximations [59].

To identify SNAP-OGs, the FASTA file and associated phylogenetic tree for each gene family with multiple homologs in 1 or more species was used as input for OrthoSNAP, v0.0.1 (this study). Across 40,011 gene families with multiple homologs in 1 or more species in all datasets, we identified 6,630 SNAP-OGs with at least 50% orthogroup occupancy (S1 Fig and S8 Table). Unaligned sequences of SNAP-OGs were then individually aligned and trimmed using the same strategy as described above. To determine gene families that were SC-OGs, we identified orthologous groups of genes with at least 50% orthogroup occupancy and each species, strain, or organism was represented by only 1 sequence—6,634 orthologous groups of genes were SC-OGs.

Measuring and comparing information content among SC-OGs and SNAP-OGs

To compare the information content of SC-OGs and SNAP-OGs, we calculated 9 properties of multiple sequence alignments and phylogenetic trees associated with robust phylogenetic signal in the budding yeasts, filamentous fungi, and mammalian datasets (S4 Table). More specifically, we calculated information content from phylogenetic trees such as measures of tree certainty (average bootstrap support), accuracy (Robinson–Foulds distance; [67]), signal-to-noise ratios (treeness; [68]), and violation of clock-like evolution (degree of violation of a molecular clock or DVMC; [69]). Information content was also measured among multiple sequence alignments by examining alignment length and the number of parsimony-informative sites, which are associated with robust and accurate inferences of evolutionary histories [70] as well as biases in sequence composition (RCV; [68]). Lastly, information content was also evaluated using metrics that consider characteristics of phylogenetic trees and multiple sequence alignments such as the degree of saturation, which refers to multiple substitutions in multiple sequence alignments that underestimate the distance between 2 taxa [71], and treeness/RCV, a measure of signal-to-noise ratios in phylogenetic trees and sequence composition biases [68]. For tree accuracy, phylogenetic trees were compared to species trees reported in previous studies [51,63,64]. All properties were calculated using functions in PhyKIT, v1.1.2 [52]. The function used to calculate each metric and additional information are described in S4 Table.

Principal component analysis across the 9 properties that summarize phylogenetic information content was used to qualitatively compare SC-OGs and SNAP-OGs in reduced dimensional space. Principal component analysis, visualization, and determination of property contribution to each principal component was conducted using factoextra, v1.0.7 [72], and FactoMineR, v2.4 [73], in the R, v4.0.2 (https://cran.r-project.org/), programming environment. Statistical analysis using a multifactor ANOVA was used to quantitatively compare SC-OGs and SNAP-OGs using the res.aov() function in R.

Information theory-based approaches were used to evaluate incongruence among SC-OGs and SNAP-OGs phylogenetic trees. More specifically, we calculated tree certainty and tree certainty-all [7476], which are conceptually similar to entropy values and are derived from examining support among a set of gene trees and the 2 most supported topologies or all topologies that occur with a frequency of ≥5%, respectively. More simply, tree certainty values range from 0 to 1 in which low values are indicative of low congruence among gene trees and high values are indicative of high congruence among gene trees. Tree certainty and tree certainty-all values were calculated using RAxML, v8.2.10 [77].

To examine patterns of support in a contentious branch concerning deep evolutionary relationships among eutherian mammals, we calculated gene support frequencies and ΔGLS. Gene support frequencies were calculated using the “polytomy_test” function in PhyKIT, v1.1.2 [52]. To account for uncertainty in gene tree topology, we also examined patterns of gene support frequencies after collapsing bipartitions with ultrafast bootstrap approximation support lower than 75 using the “collapse” function in PhyKIT. To calculate gene-wise log likelihood values, partition log-likelihoods were calculated using the “wpl” parameter in IQ-TREE2 [49], which required as input a phylogeny in Newick format that represented either hypothesis 1, 2, or 3 (Fig 4A) and a concatenated alignment of SC-OGs and SNAP-OGs with partition information. Thereafter, the log likelihood values were used to assign genes to the topology they best supported. Inconclusive genes, defined as having a gene-wise log likelihood difference of less than 0.01, were removed.

The same methodologies—orthology inference, multiple-sequence alignment, trimming, tree inference, SNAP-OG identification, and phylogenetic information content calculations—were also applied to 3 additional datasets that represent complex datasets. Specifically, 30 plants (with a history of extensive gene duplication and loss events), 30 budding yeast species (15 of which experienced whole-genome duplication), and 20 choanoflagellate transcriptomes (where typically multiple transcripts correspond to a single protein-coding gene) [31,32].

Supporting information

S1 Fig. Numbers of orthogroups, single-copy orthogroups, orthogroups with 1 or more homologs in 1 species, and the number of SNAP-OGs identified for each dataset.

(A) The total number of orthogroups with at least 50% ortholog occupancy for each dataset. (B) The number of single-copy orthologs (SC-OGs) for each dataset (with at least 50% taxon occupancy). (C) The number of multicopy orthologs (or orthologous groups of genes wherein 1 or more species is represented by 2 or more sequences; MC-OGs) for each dataset (with at least 50% taxon occupancy). (D) The number of SNAP-OGs identified in each dataset (with at least 50% taxon occupancy). Note that the numbers depicted in panel A reflect the sum of the numbers of SC-OGs and MC-OGs in panels B and C. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S2 Fig. The number of SNAP-OGs identified in orthologous groups of genes with 2 or more homologs in 1 or more species.

The number of SNAP-OGs per orthologous group of genes is depicted on the x-axis. For example, in the budding yeasts dataset, 977 gene families had 1 SNAP-OG each. The highest number of SNAP-OGs identified in a single orthologous group of genes in each dataset were as follows: in budding yeasts, 5 SNAP-OGs were identified in 1 orthologous group of genes that encode transcriptional activators; in filamentous fungi, 5 SNAP-OGs were identified in each of 2 orthologous groups of genes that encode multifacilitator superfamily transporters and amino acid permeases; and in mammals, 4 SNAP-OGs were identified in each of 3 orthologous groups of genes that encode voltage-gated potassium channels, casein kinases, and a tropomyosin family of actin-binding proteins. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S3 Fig. The 10 most frequent best-fitting substitutions models are similar between SC-OGs and SNAP-OGs.

The top 10 most frequently observed best-fitting substitutions models were similar between SC-OGs and SNAP-OGs among (A) 1,668 SC-OGs and 1,392 SNAP-OGs in budding yeasts, (B) 4,393 SC-OGs and 2,035 SNAP-OGs in filamentous fungi, and (C) 321 SC-OGs and 1,775 SNAP-OGs in mammals. For example, the LG+F+I+G4 model was the most frequently observed best-fitting substitution model in SC-OGs and SNAP-OGs from budding yeasts. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S4 Fig. Distributions of information content among SNAP-OGs and SC-OGs.

Boxplot and violin plot distributions of 9 properties representative of phylogenetic information are depicted SNAP-OGs (blue) and SC-OGs (orange) in the (A) 1,668 SC-OGs and 1,392 SNAP-OGs in budding yeasts, (B) 4,393 SC-OGs and 2,035 SNAP-OGs in filamentous fungi, and (C) 321 SC-OGs and 1,775 SNAP-OGs in mammals. Abbreviations are as follows: average bootstrap support (ABS), degree of violation of the molecular clock (DVMC), relative composition variability, Robinson-Foulds distance (RF distance), alignment length (Aln. len.), the number of parsimony informative sites (PI sites), saturation, treeness (tness), and treeness/RCV (tness/RCV). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S5 Fig. Quality of representation and contributions of properties of phylogenetic information content during principal component analysis.

Principal component analysis was used to qualitatively compare the similarities and differences between SNAP-OGs and SC-OGs (Fig 3). The leftmost figure in each panel of budding yeasts (A), filamentous fungi (B), and mammals (C) represents the quality of representation for each property across all principal components. The next 2 figures depict the contribution of each property (or variable) to the first and second dimension in reduced dimensional space. The red dashed line represents equal contributions from each variable. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S6 Fig. The number of SNAP-OGs identified in an orthologous group of genes with 2 or more homologs in 1 or more species for the dataset used to examine a contentious branch in the tree of life.

The number of SNAP-OGs per orthologous group of genes is depicted on the x-axis. For example, a single SNAP-OG was identified in 1,330 gene families with 2 or more homologs in 1 or more species, whereas 4 SNAP-OGs were identified in 2 gene families with 2 or more homologs in 1 or more species. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S7 Fig. The 10 most frequently observed best-fitting substitutions models are similar between SC-OGs and SNAP-OGs in the dataset used to examine a contentious branch in the tree of life.

Similar best-fitting substitutions models were observed between 252 SC-OGs and 1,428 SNAP-OGs in a dataset of mammals, which was used to investigate patterns of support in a contentious branch in the tree of life concerning deep evolutionary relationships among placental mammals. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

(TIF)

S8 Fig. Cartoon comparison of different tree decomposition algorithms.

Using the phylogeny presented in Fig 1B (panel A) and Fig 2B (panel B), different tree decomposition algorithms are compared. (A) OrthoSNAP will identify 4 SNAP-OGs, whereas DISCO and the maximally inclusive strategies will each identify 3 subgroups of orthologous genes. PhyloTreePruner will not identify any subgroups of single-copy orthologous genes. (B) OrthoSNAP will identify 5 subgroups of single-copy orthologous genes (light blue) by identifying maximally inclusive subgroups—subtrees where each taxon is represented by a single sequence—and maximally inclusive subgroups after species-specific inparalog trimming (species-specific inparalogs are shown in orange). In contrast, DISCO and maximally inclusive strategies will identify 3 SC-OGs, in part, because they do not account for species-specific inparalogs. PhyloTreePruner, which only prunes species-specific inparalogs, will not identify any subgroups of single-copy orthologous genes due to the presence of more ancient duplication events.

(TIF)

S1 Table. Species and accession numbers for proteomes used in each dataset.

This table details the species used for the budding yeasts, filamentous fungi, and mammalian datasets. All proteomes from budding yeasts were downloaded from Shen and colleagues [51]. Proteomes from filamentous fungi and mammals were downloaded from NCBI, and their accessions and assembly names are provided.

(XLSX)

S2 Table. Number of orthogroups examined.

A table of the number of orthogroups, the number of SC-OGs, the number of gene families with orthologs and paralogs (MC-OGs), and the number of SNAP-OGs examined in the present study.

(XLSX)

S3 Table. Ortholog occupancy for each dataset.

A table summarizing the average and standard deviation of taxon completeness in SC-OGs and SNAP-OGs.

(XLSX)

S4 Table. Nine properties of phylogenetic information content.

Phylogenetic information content of SC-OGs and SNAP-OGs were examined using the 9 properties described here. The abbreviation, description, additional notes, and function in PhyKIT used to calculate each property are listed here.

(XLSX)

S5 Table. Multifactor analysis of variance results reveals no substantial differences between SC-OGs and SNAP-OGs.

Degree of freedom, sum of squares, mean square, F-value, and p-value for multifactorial analysis of variance are shown here. Multifactorial analysis of variance was conducting accounting for potential interaction effects as well as using an additive model, which does not account for interaction effects.

(XLSX)

S6 Table. Tree certainty and tree certainty-all results.

Examining tree certainty and tree certainty-all revealed similar levels of incongruence among gene trees inferred using SC-OGs and SNAP-OGs.

(XLSX)

S7 Table. Dataset for examining deep evolutionary relationships among eutherian mammals.

The NCBI accession, assembly name, name in files, and ingroup/outgroup designations are detailed here for each proteome used.

(XLSX)

S8 Table. Number of orthogroups examined among eutherian mammals.

A table of the number of orthogroups, the number of SC-OGs, the number of gene families with orthologs and paralogs (MC-OGs), and the number of SNAP-OGs examined among eutherian mammals.

(XLSX)

S9 Table. Gene support frequency results among ancient eutherian mammalian relationships.

Gene support frequency results reveal similar levels of support between the 3 hypotheses concerning deep evolutionary divergences among mammals. Multitest corrected p-values are also shown here.

(XLSX)

S10 Table. Comparison between different algorithms that identify subgroups of orthologous genes or conduct species-specific inparalog trimming.

Notably, OrthoSNAP provides the most user flexibility and handles the most use cases.

(XLSX)

Acknowledgments

We thank the Rokas lab for helpful discussion and feedback.

Data Availability

All results and data presented in this study are available from figshare (doi: 10.6084/m9.figshare.16875904).

Funding Statement

J.L.S. and A.R. were funded by the Howard Hughes Medical Institute through the James H. Gilliam Fellowships for Advanced Study program. Research in A.R.’s lab is supported by grants from the National Science Foundation (DEB-2110404), the National Institutes of Health/National Institute of Allergy and Infectious Diseases (R56 AI146096 and R01 AI153356), and the Burroughs Wellcome Fund. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1.Rokas A, Williams BL, King N, Carroll SB. Genome-scale approaches to resolving incongruence in molecular phylogenies. Nature. 2003;425:798–804. doi: 10.1038/nature02053 [DOI] [PubMed] [Google Scholar]
  • 2.Jeffares DC, Tomiczek B, Sojo V, dos Reis M. A Beginners Guide to Estimating the Non-synonymous to Synonymous Rate Ratio of all Protein-Coding Genes in a Genome. 2015. p. 65–90. doi: 10.1007/978-1-4939-1438-8_4 [DOI] [PubMed] [Google Scholar]
  • 3.Steenwyk JL, Phillips MA, Yang F, Date SS, Graham TR, Berman J, et al. A gene coevolution network provides insight into eukaryotic cellular and genomic structure and function. bioRxiv. 2021; 2021.07.09.451830. doi: 10.1101/2021.07.09.451830 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Li Z, De La Torre AR, Sterck L, Cánovas FM, Avila C, Merino I, et al. Single-Copy Genes as Molecular Markers for Phylogenomic Studies in Seed Plants. Genome Biol Evol. 2017;9:1130–1147. doi: 10.1093/gbe/evx070 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Dong Y, Chen S, Cheng S, Zhou W, Ma Q, Chen Z, et al. Natural selection and repeated patterns of molecular evolution following allopatric divergence. Elife. 2019;8. doi: 10.7554/eLife.45199 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Wu J, Yonezawa T, Kishino H. Rates of Molecular Evolution Suggest Natural History of Life History Traits and a Post-K-Pg Nocturnal Bottleneck of Placentals. Curr Biol. 2017;27:3025–3033.e5. doi: 10.1016/j.cub.2017.08.043 [DOI] [PubMed] [Google Scholar]
  • 7.Malnic B, Godfrey PA, Buck LB. The human olfactory receptor gene family. Proc Natl Acad Sci. 2004;101:2584–2589. doi: 10.1073/pnas.0307882100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Niimura Y, Matsui A, Touhara K. Extreme expansion of the olfactory receptor gene repertoire in African elephants and evolutionary dynamics of orthologous gene groups in 13 placental mammals. Genome Res. 2014;24:1485–1496. doi: 10.1101/gr.169532.113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Ozcan S, Johnston M. Function and regulation of yeast hexose transporters. Microbiol Mol Biol Rev. 1999;63:554–569. doi: 10.1128/MMBR.63.3.554-569.1999 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wingender E, Schoeps T, Dönitz J. TFClass: an expandable hierarchical classification of human transcription factors. Nucleic Acids Res. 2013;41:D165–D170. doi: 10.1093/nar/gks1123 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Emms DM, Kelly S. STAG: Species Tree Inference from All Genes. bioRxiv. 2018;267914. doi: 10.1101/267914 [DOI] [Google Scholar]
  • 12.Thomas GWC, Dohmen E, Hughes DST, Murali SC, Poelchau M, Glastad K, et al. Gene content evolution in the arthropods. Genome Biol. 2020;21:15. doi: 10.1186/s13059-019-1925-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Smith ML, Hahn MW. New Approaches for Inferring Phylogenies in the Presence of Paralogs. Trends Genet. 2021;37:174–187. doi: 10.1016/j.tig.2020.08.012 [DOI] [PubMed] [Google Scholar]
  • 14.Zhang C, Scornavacca C, Molloy EK, Mirarab S. ASTRAL-Pro: Quartet-Based Species-Tree Inference despite Paralogy. Thorne J, editor. Mol Biol Evol. 2020;37:3292–3307. doi: 10.1093/molbev/msaa139 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Willson J, Roddur MS, Liu B, Zaharias P, Warnow T. DISCO: Species Tree Inference using Multicopy Gene Family Tree Decomposition. Hahn M, editor. Syst Biol. 2021. doi: 10.1093/sysbio/syab070 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Morel B, Schade P, Lutteropp S, Williams TA, Szöllősi GJ, Stamatakis A. SpeciesRax: A tool for maximum likelihood species tree inference from gene family trees under duplication, transfer, and loss. bioRxiv. 2021; 2021.03.29.437460. doi: 10.1101/2021.03.29.437460 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Boussau B, Szollosi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013;23:323–330. doi: 10.1101/gr.141978.112 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.de Oliveira Martins L, Posada D. Species Tree Estimation from Genome-Wide Data with guenomu. 2017. p. 461–478. doi: 10.1007/978-1-4939-6622-6_18 [DOI] [PubMed] [Google Scholar]
  • 19.Kocot KM, Citarella MR, Moroz LL, Halanych KM. PhyloTreePruner: A phylogenetic tree-based approach for selection of orthologous sequences for phylogenomics. Evol Bioinform Online. 2013;2013:429–435. doi: 10.4137/EBO.S12813 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Dunn CW, Howison M, Zapata F. Agalma: an automated phylogenomics workflow. BMC Bioinformatics. 2013;14:330. doi: 10.1186/1471-2105-14-330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Train C-M, Glover NM, Gonnet GH, Altenhoff AM, Dessimoz C. Orthologous Matrix (OMA) algorithm 2.0: more robust to asymmetric evolutionary rates and more scalable hierarchical orthologous group inference. Bioinformatics. 2017;33:i75–i82. doi: 10.1093/bioinformatics/btx229 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Schuh RT, Polhemus JT. Analysis of Taxonomic Congruence among Morphological, Ecological, and Biogeographic Data Sets for the Leptopodomorpha (Hemiptera). Syst Biol. 1980;29:1–26. doi: 10.1093/sysbio/29.1.1 [DOI] [Google Scholar]
  • 23.Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. doi: 10.1016/s1055-7903(03)00057-5 [DOI] [PubMed] [Google Scholar]
  • 24.Defoort J, Van de Peer Y, Carretero-Paulet L. The evolution of gene duplicates in angiosperms and the impact of protein-protein interactions and the mechanism of duplication. Golding B, editor. Genome Biol Evol. 2019. doi: 10.1093/gbe/evz156 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.De Smet R, Adams KL, Vandepoele K, Van Montagu MCE, Maere S, Van de Peer Y. Convergent gene loss following gene and genome duplications creates single-copy families in flowering plants. Proc Natl Acad Sci. 2013;110:2898–2903. doi: 10.1073/pnas.1300127110 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Panchy N, Lehti-Shiu M, Shiu S-H. Evolution of Gene Duplication in Plants. Plant Physiol. 2016;171:2294–2316. doi: 10.1104/pp.16.00523 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Scannell DR, Byrne KP, Gordon JL, Wong S, Wolfe KH. Multiple rounds of speciation associated with reciprocal gene loss in polyploid yeasts. Nature. 2006;440:341–345. doi: 10.1038/nature04562 [DOI] [PubMed] [Google Scholar]
  • 28.Wolfe KH. Origin of the Yeast Whole-Genome Duplication. PLoS Biol. 2015;13:e1002221. doi: 10.1371/journal.pbio.1002221 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Wolfe KH, Shields DC. Molecular evidence for an ancient duplication of the entire yeast genome. Nature. 1997;387:708–713. doi: 10.1038/42711 [DOI] [PubMed] [Google Scholar]
  • 30.Marcet-Houben M, Gabaldón T. Beyond the Whole-Genome Duplication: Phylogenetic Evidence for an Ancient Interspecies Hybridization in the Baker’s Yeast Lineage. Hurst LD, editor. PLoS Biol. 2015;13:e1002220. doi: 10.1371/journal.pbio.1002220 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Richter DJ, Fozouni P, Eisen MB, King N. Gene family innovation, conservation and loss on the animal stem lineage. Elife. 2018;7. doi: 10.7554/eLife.34226 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotechnol. 2011;29:644–652. doi: 10.1038/nbt.1883 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Hallström BM, Kullberg M, Nilsson MA, Janke A. Phylogenomic Data Analyses Provide Evidence that Xenarthra and Afrotheria Are Sister Groups. Mol Biol Evol. 2007;24:2059–2068. doi: 10.1093/molbev/msm136 [DOI] [PubMed] [Google Scholar]
  • 34.Wildman DE, Uddin M, Opazo JC, Liu G, Lefort V, Guindon S, et al. Genomics, biogeography, and the diversification of placental mammals. Proc Natl Acad Sci. 2007;104:14395–14400. doi: 10.1073/pnas.0704342104 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Murphy WJ. Resolution of the Early Placental Mammal Radiation Using Bayesian Phylogenetics. Science. 2001;294:2348–2351. doi: 10.1126/science.1067179 [DOI] [PubMed] [Google Scholar]
  • 36.Murphy WJ, Eizirik E, Johnson WE, Zhang YP, Ryder OA, O’Brien SJ. Molecular phylogenetics and the origins of placental mammals. Nature. 2001;409:614–618. doi: 10.1038/35054550 [DOI] [PubMed] [Google Scholar]
  • 37.Smith ML, Vanderpool D, Hahn MW. Using all gene families vastly expands data available for phylogenomic inference in primates. bioRxiv 2021; 2021.09.22.461252. doi: 10.1101/2021.09.22.461252 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.van der Heijden RT, Snel B, van Noort V, Huynen MA. Orthology prediction at scalable resolution by phylogenetic tree analysis. BMC Bioinformatics. 2007;8:83. doi: 10.1186/1471-2105-8-83 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Jarvis ED, Mirarab S, Aberer AJ, Li B, Houde P, Li C, et al. Whole-genome analyses resolve early branches in the tree of life of modern birds. Science. 2014;346:1320–1331. doi: 10.1126/science.1253451 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Steenwyk JL, Lind AL, Ries LNA, dos Reis TF, Silva LP, Almeida F, et al. Pathogenic Allodiploid Hybrids of Aspergillus Fungi. Curr Biol. 2020;30:2495–2507.e7. doi: 10.1016/j.cub.2020.04.071 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Meleshko O, Martin MD, Korneliussen TS, Schröck C, Lamkowski P, Schmutz J, et al. Extensive Genome-Wide Phylogenetic Discordance Is Due to Incomplete Lineage Sorting and Not Ongoing Introgression in a Rapidly Radiated Bryophyte Genus. Mol Biol Evol. 2021;38:2750–2766. doi: 10.1093/molbev/msab063 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Emms DM, Kelly S. OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol. 2019;20:238. doi: 10.1186/s13059-019-1832-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 43.Li L, Stoeckert CJ, Roos DS. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 2003;13:2178–2189. doi: 10.1101/gr.1224503 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44.Steenwyk JL, Rokas A. orthofisher: a broadly applicable tool for automated gene identification and retrieval. Comeron JM, editor. G3 (Bethesda). 2021;11. doi: 10.1093/g3journal/jkab250 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, et al. BLAST+: architecture and applications. BMC Bioinformatics. 2009;10:421. doi: 10.1186/1471-2105-10-421 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–2461. doi: 10.1093/bioinformatics/btq461 [DOI] [PubMed] [Google Scholar]
  • 47.Eddy SR. Accelerated Profile HMM Searches. Pearson WR, editor. PLoS Comput Biol. 2011;7:e1002195. doi: 10.1371/journal.pcbi.1002195 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Price MN, Dehal PS, Arkin AP. FastTree 2—Approximately maximum-likelihood trees for large alignments. PLoS ONE. 2010;5. doi: 10.1371/journal.pone.0009490 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Teeling E, editor. Mol Biol Evol. 2020;37:1530–1534. doi: 10.1093/molbev/msaa015 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Zhou X, Shen X-X, Hittinger CT, Rokas A. Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets. Mol Biol Evol. 2018;35:486–503. doi: 10.1093/molbev/msx302 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Shen X-X, Opulente DA, Kominek J, Zhou X, Steenwyk JL, Buh KV, et al. Tempo and Mode of Genome Evolution in the Budding Yeast Subphylum. Cell. 2018;175:1533–1545.e20. doi: 10.1016/j.cell.2018.10.023 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Steenwyk JL, Buida TJ, Labella AL, Li Y, Shen X-X, Rokas A. PhyKIT: a broadly applicable UNIX shell toolkit for processing and analyzing phylogenomic data. Schwartz R, editor. Bioinformatics (Oxford, England). 2021. doi: 10.1093/bioinformatics/btab096 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Tice AK, Žihala D, Pánek T, Jones RE, Salomaki ED, Nenarokov S, et al. PhyloFisher: A phylogenomic package for resolving eukaryotic relationships. Hejnol A, editor. PLoS Biol. 2021;19:e3001365. doi: 10.1371/journal.pbio.3001365 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Ballesteros JA, Hormiga G. A New Orthology Assessment Method for Phylogenomic Data: Unrooted Phylogenetic Orthology. Mol Biol Evol. 2016;33:2117–2134. doi: 10.1093/molbev/msw069 [DOI] [PubMed] [Google Scholar]
  • 55.Yang Y, Smith SA. Orthology Inference in Nonmodel Organisms Using Transcriptomes and Low-Coverage Genomes: Improving Accuracy and Matrix Occupancy for Phylogenomics. Mol Biol Evol. 2014;31:3081–3092. doi: 10.1093/molbev/msu245 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Shen X-X, Steenwyk JL, LaBella AL, Opulente DA, Zhou X, Kominek J, et al. Genome-scale phylogeny and contrasting modes of genome evolution in the fungal phylum Ascomycota. Sci Adv. 2020;6:eabd0079. doi: 10.1126/sciadv.abd0079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Steenwyk JL, Opulente DA, Kominek J, Shen X-X, Zhou X, Labella AL, et al. Extensive loss of cell-cycle and DNA repair genes in an ancient lineage of bipolar budding yeasts. Kamoun S, editor. PLoS Biol. 2019;17:e3000255. doi: 10.1371/journal.pbio.3000255 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Vakirlis N, Sarilar V, Drillon G, Fleiss A, Agier N, Meyniel J-P, et al. Reconstruction of ancestral chromosome architecture and gene repertoire reveals principles of genome evolution in a model yeast genus. Genome Res. 2016;26:918–932. doi: 10.1101/gr.204420.116 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–522. doi: 10.1093/molbev/msx281 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 60.Mangul S, Martin LS, Eskin E, Blekhman R. Improving the usability and archival stability of bioinformatics software. Genome Biol. 2019;20:47. doi: 10.1186/s13059-019-1649-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Steenwyk JL, Buida TJ, Li Y, Shen X-X, Rokas A. ClipKIT: A multiple sequence alignment trimming software for accurate phylogenomic inference. Hejnol A, editor. PLoS Biol. 2020;18: e3001007. doi: 10.1371/journal.pbio.3001007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Steenwyk JL, Buida TJ, Gonçalves C, Goltz DC, Morales G, Mead ME, et al. BioKIT: a versatile toolkit for processing and analyzing diverse types of sequence data. Stajich J, editor. Genetics. 2022. doi: 10.1093/genetics/iyac079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 63.Steenwyk JL, Shen X-X, Lind AL, Goldman GH, Rokas A. A Robust Phylogenomic Time Tree for Biotechnologically and Medically Important Fungi in the Genera Aspergillus and Penicillium. Boyle JP, editor. MBio. 2019;10. doi: 10.1128/mBio.00925-19 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 64.Tarver JE, dos Reis M, Mirarab S, Moran RJ, Parker S, O’Reilly JE, et al. The Interrelationships of Placental Mammals and the Limits of Phylogenetic Inference. Genome Biol Evol. 2016;8:330–344. doi: 10.1093/gbe/evv261 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 65.Luo Z-X, Yuan C-X, Meng Q-J, Ji Q. A Jurassic eutherian mammal and divergence of marsupials and placentals. Nature. 2011;476:442–445. doi: 10.1038/nature10291 [DOI] [PubMed] [Google Scholar]
  • 66.Katoh K, Standley DM. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol Biol Evol. 2013;30:772–780. doi: 10.1093/molbev/mst010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 67.Robinson DF, Foulds LR. Comparison of phylogenetic trees. Math Biosci. 1981;53:131–147. doi: 10.1016/0025-5564(81)90043-2 [DOI] [Google Scholar]
  • 68.Phillips MJ, Penny D. The root of the mammalian tree inferred from whole mitochondrial genomes. Mol Phylogenet Evol. 2003;28:171–185. doi: 10.1016/s1055-7903(03)00057-5 [DOI] [PubMed] [Google Scholar]
  • 69.Liu L, Zhang J, Rheindt FE, Lei F, Qu Y, Wang Y, et al. Genomic evidence reveals a radiation of placental mammals uninterrupted by the KPg boundary. Proc Natl Acad Sci. 2017;114:E7282–E7290. doi: 10.1073/pnas.1616744114 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 70.Shen X-X, Salichos L, Rokas A. A Genome-Scale Investigation of How Sequence, Function, and Tree-Based Gene Properties Influence Phylogenetic Inference. Genome Biol Evol. 2016;8:2565–2580. doi: 10.1093/gbe/evw179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M, Wörheide G, et al. Resolving Difficult Phylogenetic Questions: Why More Sequences Are Not Enough. Penny D, editor. PLoS Biol. 2011;9:e1000602. doi: 10.1371/journal.pbio.1000602 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Kassambara A, Mundt F. factoextra. R package, v. 1.0.5. 2017. [Google Scholar]
  • 73.Lê S, Josse J, Husson F. FactoMineR: An R Package for Multivariate Analysis. J Stat Softw. 2008;25:1–18. doi: 10.18637/jss.v025.i01 [DOI] [Google Scholar]
  • 74.Salichos L, Rokas A. Inferring ancient divergences requires genes with strong phylogenetic signals. Nature. 2013;497:327–331. doi: 10.1038/nature12130 [DOI] [PubMed] [Google Scholar]
  • 75.Salichos L, Stamatakis A, Rokas A. Novel Information Theory-Based Measures for Quantifying Incongruence among Phylogenetic Trees. Mol Biol Evol. 2014;31:1261–1271. doi: 10.1093/molbev/msu061 [DOI] [PubMed] [Google Scholar]
  • 76.Kobert K, Salichos L, Rokas A, Stamatakis A. Computing the Internode Certainty and Related Measures from Partial Gene Trees. Mol Biol Evol. 2016;33:1606–1617. doi: 10.1093/molbev/msw040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30:1312–1313. doi: 10.1093/bioinformatics/btu033 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Song S, Liu L, Edwards SV, Wu S. Resolving conflict in eutherian mammal phylogeny using phylogenomics and the multispecies coalescent model. Proc Natl Acad Sci. 2012;109:14942–14947. doi: 10.1073/pnas.1211733109 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Doyle VP, Young RE, Naylor GJP, Brown JM. Can We Identify Genes with Increased Phylogenetic Reliability? Syst Biol. 2015;64:824–837. doi: 10.1093/sysbio/syv041 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Roland G Roberts

9 Nov 2021

Dear Antonis,

Thank you for submitting your manuscript entitled "­­orthoSNAP: a tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees" for consideration as a Methods and Resources paper by PLOS Biology.

Your manuscript has now been evaluated by the PLOS Biology editorial staff, as well as by an academic editor with relevant expertise, and I'm writing to let you know that we would like to send your submission out for external peer review.

However, before we can send your manuscript to reviewers, we need you to complete your submission by providing the metadata that is required for full assessment. To this end, please login to Editorial Manager where you will find the paper in the 'Submissions Needing Revisions' folder on your homepage. Please click 'Revise Submission' from the Action Links and complete all additional questions in the submission questionnaire.

Once your full submission is complete, your paper will undergo a series of checks in preparation for peer review. Once your manuscript has passed the checks it will be sent out for review.

If your manuscript has been previously reviewed at another journal, PLOS Biology is willing to work with those reviews in order to avoid re-starting the process. Submission of the previous reviews is entirely optional and our ability to use them effectively will depend on the willingness of the previous journal to confirm the content of the reports and share the reviewer identities. Please note that we reserve the right to invite additional reviewers if we consider that additional/independent reviewers are needed, although we aim to avoid this as far as possible. In our experience, working with previous reviews does save time.

If you would like to send your previous reviewer reports to us, please specify this in the cover letter, mentioning the name of the previous journal and the manuscript ID the study was given, and include a point-by-point response to reviewers that details how you have or plan to address the reviewers' concerns. Please contact me at the email that can be found below my signature if you have questions.

Please re-submit your manuscript within two working days, i.e. by Nov 11 2021 11:59PM.

Login to Editorial Manager here: https://www.editorialmanager.com/pbiology

During resubmission, you will be invited to opt-in to posting your pre-review manuscript as a bioRxiv preprint. Visit http://journals.plos.org/plosbiology/s/preprints for full details. If you consent to posting your current manuscript as a preprint, please upload a single Preprint PDF when you re-submit.

Given the disruptions resulting from the ongoing COVID-19 pandemic, please expect delays in the editorial process. We apologise in advance for any inconvenience caused and will do our best to minimize impact as far as possible.

Feel free to email us at plosbiology@plos.org if you have any queries relating to your submission.

Kind regards,

Roli

Roland Roberts

Senior Editor

PLOS Biology

rroberts@plos.org

Decision Letter 1

Roland G Roberts

17 Jan 2022

Dear Dr Rokas,

Thank you for submitting your manuscript "­­orthoSNAP: a tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees" for consideration as a Methods and Resources at PLOS Biology. Your manuscript has been evaluated by the PLOS Biology editors, an Academic Editor with relevant expertise, and by three independent reviewers.

You'll see that each of the reviewers is broadly positive about your study, but each raises a number of concerns that need to be addressed before further consideration. For example, you’ll see that Reviewer #1 wants you to clarify how the method differs from others, Reviewer #2 worries that your method is simplistic and might break down when faced with more complex scenarios (and wants to be convinced otherwise), and Reviewer #3 found orthoSNAP easy to use, but queries the decision to always retain the longest orthologue, questions the power of your stats, and wonders whether you could incorporate synteny.

In light of the reviews (below), we will not be able to accept the current version of the manuscript, but we would welcome re-submission of a much-revised version that takes into account the reviewers' comments. We cannot make any decision about publication until we have seen the revised manuscript and your response to the reviewers' comments. Your revised manuscript is also likely to be sent for further evaluation by the reviewers.

We expect to receive your revised manuscript within 3 months.

Please email us (plosbiology@plos.org) if you have any questions or concerns, or would like to request an extension. At this stage, your manuscript remains formally under active consideration at our journal; please notify us by email if you do not intend to submit a revision so that we may end consideration of the manuscript at PLOS Biology.

**IMPORTANT - SUBMITTING YOUR REVISION**

Your revisions should address the specific points made by each reviewer. Please submit the following files along with your revised manuscript:

1. A 'Response to Reviewers' file - this should detail your responses to the editorial requests, present a point-by-point response to all of the reviewers' comments, and indicate the changes made to the manuscript.

*NOTE: In your point by point response to the reviewers, please provide the full context of each review. Do not selectively quote paragraphs or sentences to reply to. The entire set of reviewer comments should be present in full and each specific point should be responded to individually, point by point.

You should also cite any additional relevant literature that has been published since the original submission and mention any additional citations in your response.

2. In addition to a clean copy of the manuscript, please also upload a 'track-changes' version of your manuscript that specifies the edits made. This should be uploaded as a "Related" file type.

*Re-submission Checklist*

When you are ready to resubmit your revised manuscript, please refer to this re-submission checklist: https://plos.io/Biology_Checklist

To submit a revised version of your manuscript, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' where you will find your submission record.

Please make sure to read the following important policies and guidelines while preparing your revision:

*Published Peer Review*

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*PLOS Data Policy*

Please note that as a condition of publication PLOS' data policy (http://journals.plos.org/plosbiology/s/data-availability) requires that you make available all data used to draw the conclusions arrived at in your manuscript. If you have not already done so, you must include any data used in your manuscript either in appropriate repositories, within the body of the manuscript, or as supporting information (N.B. this includes any numerical values that were used to generate graphs, histograms etc.). For an example see here: http://www.plosbiology.org/article/info%3Adoi%2F10.1371%2Fjournal.pbio.1001908#s5

*Blot and Gel Data Policy*

We require the original, uncropped and minimally adjusted images supporting all blot and gel results reported in an article's figures or Supporting Information files. We will require these files before a manuscript can be accepted so please prepare them now, if you have not already uploaded them. Please carefully read our guidelines for how to prepare and upload this data: https://journals.plos.org/plosbiology/s/figures#loc-blot-and-gel-reporting-requirements

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Thank you again for your submission to our journal. We hope that our editorial process has been constructive thus far, and we welcome your feedback at any time. Please don't hesitate to contact us if you have any questions or comments.

Sincerely,

Roli Roberts

Roland Roberts

Senior Editor

PLOS Biology

rroberts@plos.org

*****************************************************

REVIEWERS' COMMENTS:

Reviewer #1:

[identifies himself as Yannis Nevers]

The authors present OrthoSNAP, a method that automatically extracts marker genes (a set of orthologous sequences with representatives from most species in a given dataset) from a set of gene families and their corresponding gene trees.

Contrary to more traditional method; it does not focus on single copy orthologous groups (sc-OGs) and also extracts orthologous groups from gene families even in presence of duplication (termed SNAP-OGs). Using SNAP-OGs, and not only Sc-OG, allows to increase the number of gene markers one can extract from a set species, to later use in phylogenomic analyses.

The authors show, comparing it over several descriptors, that the phylogenetic signal in SNAP-OGs is not significantly different than one in Sc-OG. They also show in a specific case that using SNAP-OGs for species tree determination on a difficult to infer taxonomic node yield similar results than using sc-OG, suggesting SNAP-OGs could be used the same way sc-OGs are. The analyses are convincing and to my knowledge, one of the first confirmation that SNAP-OGs can be used on the same capacity of Sc-OGs in phylogenomic analysis.

The manuscript is well organized and clearly written. My review below is divided into sections relative to different aspects of the work, and concludes with my recommandations.

NOVELTY:

SNAP-OG is based on a tree splitting and pruning strategy, used to isolate OGs nested into gene families that underwent ancient duplications and keep only one copies from recently duplicated species-specific inparalog. Generally, the aim is to extract a set of orthologous sequences with only one representative of most of the species of interest, even in presence of duplication events.

As the authors discuss in the introduction, this is not per se new in concept. In particular, methods like PhyloTreepruner and Agalma have similar aims and can at least be used to handle species-specific inparalogs.

The authors discuss this and claim their method handles more types of paralogy than previously existing methods, and to my understanding, can extract more OGs from a set of gene families with duplication events than these methods could (presumably because of the combination of tree splitting and pruning that is implemented in OrthoSNAP). Still, it is unclear from the current discussion exactly how they differ, and in what concrete case OrthoSNAP yields more orthologous groups than these methods.

Additionally, two other published works appear to implement similar strategies to extract OGs from gene families with duplication: DISCO (Willson et al, 2021) [doi: 10.1093/sysbio/syab070] and the different strategies of orthologs sampling proposed in Yang and Smith, 20145] [doi:10.1093/molbev/msu24 (in particular the Maximum inclusion strategy). These works are cited in the manuscript but there is no discussion relative to the strategy they use for what looks like to be a similar task.

Finally, I note that SNAP-OGs provided by OrthoSNAP seems to be conceptually similar to OMA Groups provided by the OMA Orthology inference method (see Identifying orthologs with OMA: A primer, Wahn-Zabal et al, doi: 10.12688/f1000research.21508.1). Do the authors believe it is the case? If so, it may be worth mentioning.

While I believe these aspects may be discussed a bit more (see recommendations), OrthoSNAP does seem to implement a different strategy than existing methods and the fact that it is implemented as an user-friendly command line tool with few input constraints is valuable.

The assessment of the difference between sc-OGs and SNAP-OGs is both new and has interesting implications regarding the use of SNAP-OGs in phylogenomic studies. It has a similar conclusion than another independent but somewhat similar preprinted studies. This is also addressed by the authors in the discussion section.

SOFTWARE ACCESSIBILITY, EASE OF USE AND REPRODUCIBILITY

The code is available on GitHub, and is well documented. The documentation provided an easy and quick way to install the software and was sufficient for me to successfully run it on a test dataset.

The datasets used in the course of the work are freely available on figshare and the protocole is sufficiently described in the method section

RECOMMENDATION:

I believe the work would be of interest for the community, in good part because it provides easy to use software for the extraction of marker genes from phylogenomic datasets, even in presence of duplication. I thus recommend its publication without additional analyses needed.

However, given the similarity of the work with previously available software, I'd like to see a more detailed discussion on how OrthoSNAP differs from previous works. In particular:

-The author indicates that OrthoSNAP is different from two other tree pruning protocole. It would be useful to indicate a few example cases of when it is expected to provide different results.

-I would also be interested in a discussion of the comparison with other published strategies for what seems to be a similar task: in particular the Maximum inclusion strategy from Yang et al, 2014, the methods used by DISCO. (See the Novelty section above)

Minor remark:

-The authors use the word "taxon" to refer only to the individual species in their datasets. It causes a bit of confusion because usually any grouping of multiple species sharing a common ancestor may be considered a taxon. I suggest the authors disambiguate the term.

Reviewer #2:

The authors present here orthoSNAP a program designed to obtain orthogroups from gene trees including duplications with the objective of obtaining more marker genes and therefore increasing the number of genes available for numerous evolutionary studies. As stated by the authors, this has been done before in multiple ways and the improvement given by orthoSNAP is more focused on usability by non-expert users than on creating an innovative algorithm. They then test their orthoSNAP on different sets of species: budding yeasts, filamentous fungi and a set of mammals. They show how using orthoSNAP they can increase the number of orthogroups significantly depending on the dataset and that those orthogroups are not different in terms of general phylogenetic characteristics making them suitable for posterior analyses.

I have concerns that the orthoSNAP algorithm is a bit too simple and will not hold up in the case of more complex scenarios (note also my comments on the datasets below). A very obvious scenario in which I think orthoSNAP will fail is when duplications are concentrated in a given node. For instance, imagine that the budding yeast dataset was formed of more than 50% species that underwent the WGD that happened in the Saccharomyces cerevisiae lineage and the rest was made of pre-WGD species. As I understand, the algorithm would look for clades that include more than 50% of the species set and no duplications after removing species specific duplicates. In the dataset I proposed you would have numerous orthogroups that would include uniquely post-WGD species and would not include the pre-WGD orthologs despite existing, being present in the tree and being easily detectable. I wonder if the taxon distribution between SC-OGs and SNAP-OGs would still be the same and how the other metrics would be affected. The same problem can be extended to any tree analysed, orthologs found before a duplication large enough to cope the taxon occupancy threshold will not be included in orthogroups.

Note that in the introduction the authors used plants as an example of a complicated dataset where tools such as orthoSNAP would be of great advantage, yet the authors used two fungal datasets for which they already have a significant amount of single copy families. This is great for showing whether there are differences between SC-OGs and SNAP-OGs, but they do not offer a true challenge for the program. The mammal dataset is more meaningful in terms of what orthoSNAP can offer and I think there should be more emphasis put on this.

Regarding the mammal dataset, the improvement is notable and it certainly is useful but I miss a bit more of context on why of the 17407 multi copy families orthoSNAP only retrieves 1775.

Regarding the comparison between SC-OGs and SNAP-OGs, I understand that the underlying point of the analysis is to show that orthogroups build from single copy genes and from SNAP will give similar results and that it is so because orthoSNAP is able to successfully capture only orthologs and not include unwanted paralogs. For that point to be meaningful, you should also show that randomly including paralogs would be detrimental to the metrics you are using and would therefore give a different signal in these specific datasets. If not added, one could argue that, no matter how well or poorly SNAP performs the results are always going to be the same. Points also to see whether mixing different kinds of paralogs have different effects.

Minor comments:

1.- When having a species specific duplication orthoSNAP keeps only the gene with the longest sequence, argument based on transcriptomic data. I would be interested in reading why authors did not choose to keep the paralog with the shortest branch as that would be more conservative when thinking on running some of the posterior analyses.

2.- In the results section I miss having plain numbers shown about the real improvement of orthoSNAP. It is put as a percentage and directed to a supplementary table but I think the number of orthogroups retrieved for each dataset should be put there.

3.- It would also be interesting to have more of an idea from the start which species are used in the analysis. Filamentous fungi is not the same as saying a group of Aspergillus and Penicillium genomes that are very closely related.

4.- I wonder how orthoSNAP would perform without having the pre-filter of orthofinder. Orthofinder is already putting genes into orthologous families but one could build a tree based on blast results and just run orthoSNAP on that. Would results be comparable?

Reviewer #3:

[identify themselves as Giulio Formenti and Erich Jarvis]

General comments:

Steenwyk et al. present orthoSNAP, a tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees. The tool is aimed at addressing the important issue in comparative genomics of identifying unique orthologous genes for evolutionary inference.

The logic behind the tool is clear, well-explained, and well-detailed. The tool is well-documented, and we were easily able to install it (on Mac) and run it using the test data set and the tutorial provided. The authors also use orthoSNAP to explore two contentious alternative hypotheses of deep evolutionary relationships among placental mammals. The good news is that from 46,645 orthologous groups of genes, their method identified 6400 more single copy orthologs in addition to the 6600 by another method. My worrisome news is what the many other putative ortholos of the ~46,000?. Perhaps we am missing something here?

Besides this, we have three main concerns, one regarding the simplicity of a key step of the pipeline, another regarding statistics, and a third regarding complementary evidence, such as synteny.

With respect to simplicity:

At the pruning step, the paralog to be retained is always the longest. The justification given for this choice is that it is common practice in transcriptomics. I can see why this is the case in transcriptomics, e.g. to increase mappability to the genome, however in the alignment of ortholog genes, choosing the longest transcript this may not be the best choice and could rather introduce bias in the results to a particular spliced exon missing in some of the longest transcript or additional erroneously annotated codons. I wonder if this very simple assumption could undermine some of the results. Can complementary/alternative method using all transcripts, be envisaged and introduced?

With respect to statistics:

Throughout the manuscript, the absence of statistical significance is generally given as evidence to conclude that identified SNAP-OGs behave similarly to SG-OGs. Since the absence of significance can be due to no real significance or not enough power, these results and their interpretation should be phrased more carefully. Additionally, to some degree, one would expect genes with high numbers of paralogs to behave somewhat differently from truly single-copy orthologs; there seems to be some evidence that this is happening, which would be interesting to characterize since, as the authors put it, "the phylogenetic information content of these gene families remains unknown". In particular:

- It seems that the non-significant results in the multivariate analyses may be partially explained by the high dispersion of the data, including in at least some cases the presence of many outliers (e.g. in alignment length, parsimony informative sites, RCV and treeness show higher SD in mammalian SNAP-OGs Fig S9). Are there at least some genes that behave differently? Can the authors comment on that? For instance, in Budding yeast it seems that tree certainty is consistently lower in SNAP-OGs vs SG-OGs (ST6). Can the authors further describe ST6 and comment on this in the main text?

- In Fig 4B, can the authors explain why the added a third hypothesis when testing the evolutionary relationships in placental mammals ("Xenarthra as sister to all other Eutheria represented in yellow")? There is a discrepancy in the main text, as initially only two hypotheses are mentioned, but then a third is introduced and used for only one of the analyses. A third hypothesis will obviously reduce the statistical significance of the results after correction for multiple testing, potentially making results non-significant, which seems to be the case here. Also, why was the alpha set at 0.01 instead of the usual 0.05? Finally, do you think that more sophisticated statistical tests than Fisher's using GSF values that would be able to reveal differences?

- In Fig 4C, the outliers collapse most of the central points. However, the two distributions do look different. Aside from the absence of significant differences in GLS, which could be the result of the great dispersion of the data, could a test that compares the shape of the distribution reveal a significant difference? Also, Fig 4C should be bigger to better highlight the distribution? Maybe a transformation could also be applied to the y axis to better highlight the data.

With respect to synteny:

Because of different rates of difference, sometimes it is difficult to indentify orthologs versus paralogs based on sequence identity and phylogeny alone. Synteny is a third piece evidence that helps identify orthology. A good recent example on the oxytocin receptor and ligand family of genes (Theofanopoulou et al 2021 Nature). Have the authors utilized synteny? If not, it would be good to incorporate. At a minimum the authors should discuss this issue.

Specific comments:

Abstract:

- consider dropping "selection"", as negative selection also needs SC_OGs

- consider dropping "in contrast", since there is no contrast with the previous sentence

- consider spelling "SNAP" in the software name, also to provide clues on its mode of operation, rather than providing a definition of SNAP-OGs directly in the abstract.

Figure 1. Given that one of the claims the authors make in the discussion is that orthoSNAP, as opposed to other tools, can distinguish multiple paralog classes, a more complete explanation of such classes would greatly help the reader. It would then be probably more useful to provide a cartoon tailored to the classification made by orthoSNAP, rather than a general framework. This figure could then be part of the results; or combined with the current figure 1.

Last sentence of first paragraph of the introduction: The availability of more genomes actually make phylogenetic inference and thus ortholgy of genes easier rather than more complex or difficult. Could this sentence be rephrased to account for this?

End of second paragraph of introduction: Rather than broad, I'd say that the past algorithms were not designed to retrieve homologs in large gene families.

The final paragraph of the introduction should be part of the results, and a brief 1-3 sentences stated about the results should be what is in the introduction.

Results, first paragraph. Do the authors have a suggested reason as to why there is variation in the results depending on lineage? It is technical or biological reason?

"Similar to taxon occupancy…." since a 50% cutoff was applied by design to both SC-OGs and SNAP-OGs, isn't this expected?

Second to last sentence of same paragraph: In terms congruence, in contradiction to this, in Budding yeast it seems that tree certainty is consistently lower in SNAP-OGs vs SG-OGs (ST6).

"Resolution of this conflict": It does not look like this analysis is aimed at resolving said conflict on phylogeny, but rather to demonstrate no difference between SNAP-OGs and SC-OGs results. Please rephrase

"More specifically, no differences…" this value is close to a 0.05 significance, suggesting that collapsing branches with low support values starts to highlight differences between SC-OGs and SNAP-OGs. I think these differences could be valuable to understand the nature of SNAP-OGs and deserve further analysis. Why use Benjamini-Hochberg correction?

Last paragraph of results: Why is this striking? One could expect that in difficult-to-resolve branches the noise would be even higher, reducing our power to detect significant differences

Discussion. First 1.5 paragraphs are repetitive of the introduction. They can be deleted.

In certain datasets, .. This only refers to mammals. I would therefore specify

Can the authors suggest a possible interpretation of why SNAP-ORG were five times more prevalent than SC-OGs?

Qualitatively similar: Can the authors clarify what does qualitatively similar mean?

We also note that our algorithm: This is interesting, however it is the first time outparalogs, inparalogs, etc are mentioned (besides figure 1 legend). It does not appear to be mentioned in Methods either. More details would be useful.

Three noteworthy differences: these differences are not in the results but in the approach, please clarify.

Methods, Dataset generation: This is the first time that it is made clear that the analysis is conducted on protein sequences rather than on DNA sequences (which is implicit when talking about genes). This is an important point and should be made immediately clear, first time in the abstract.

Supplementary figures:

- Fig S1 Information content is minimal and fully overlaps with the text, therefore it doesn't seem a necessary figure.

- Fig S2 can scales be reported directly in axis ticks rather than in axis titles.

- Fig S7 Similar to Fig S1, the content is minimal and fully overlaps with the main text, therefore it doesn't seem a necessary figure.

Decision Letter 2

Roland G Roberts

9 Sep 2022

Dear Dr Rokas,

Thank you for your patience while we considered your revised manuscript "­­OrthoSNAP: a tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees" for publication as a Methods and Resources at PLOS Biology. This revised version of your manuscript has been evaluated by the PLOS Biology editors, the Academic Editor, and two of the original reviewers.

Based on the reviews, we are likely to accept this manuscript for publication, provided you satisfactorily address the remaining points raised by the reviewers. Please also make sure to address the following data and other policy-related requests.

IMPORTANT: Please attend to the following:

a) Please address the remaining minor requests from the two reviewers.

b) Please address my Data Policy requests below; specifically, we need you to supply the numerical values underlying Figs 3ABC, 4BC, S1ABCD, S2, S3ABC, S4ABC, S5ABC, S6, S7, S8B, either as a supplementary data file or as a permanent DOI’d deposition, e.g. part of your Figshare depo (my understanding from the Figshare DOI is that this contains the raw input data, rather than the data presented in these Figure panels; please clarify).

c) Please cite the location of the data clearly in all relevant main and supplementary Figure legends, e.g. “The data underlying this Figure can be found in S1 Data” or “The data underlying this Figure can be found in https://figshareXXXXX”

As you address these items, please take this last chance to review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the cover letter that accompanies your revised manuscript.

We expect to receive your revised manuscript within two weeks.

To submit your revision, please go to https://www.editorialmanager.com/pbiology/ and log in as an Author. Click the link labelled 'Submissions Needing Revision' to find your submission record. Your revised submission must include the following:

- a cover letter that should detail your responses to any editorial requests, if applicable, and whether changes have been made to the reference list

- a Response to Reviewers file that provides a detailed response to the reviewers' comments (if applicable)

- a track-changes file indicating any changes that you have made to the manuscript.

NOTE: If Supporting Information files are included with your article, note that these are not copyedited and will be published as they are submitted. Please ensure that these files are legible and of high quality (at least 300 dpi) in an easily accessible file format. For this reason, please be aware that any references listed in an SI file will not be indexed. For more information, see our Supporting Information guidelines:

https://journals.plos.org/plosbiology/s/supporting-information

*Published Peer Review History*

Please note that you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out. Please see here for more details:

https://blogs.plos.org/plos/2019/05/plos-journals-now-open-for-published-peer-review/

*Press*

Should you, your institution's press office or the journal office choose to press release your paper, please ensure you have opted out of Early Article Posting on the submission form. We ask that you notify us as soon as possible if you or your institution is planning to press release the article.

*Protocols deposition*

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols

Please do not hesitate to contact me should you have any questions.

Sincerely,

Roli Roberts

Roland Roberts, PhD

Senior Editor,

rroberts@plos.org,

PLOS Biology

------------------------------------------------------------------------

DATA POLICY:

You may be aware of the PLOS Data Policy, which requires that all data be made available without restriction: http://journals.plos.org/plosbiology/s/data-availability. For more information, please also see this editorial: http://dx.doi.org/10.1371/journal.pbio.1001797

Note that we do not require all raw data. Rather, we ask that all individual quantitative observations that underlie the data summarized in the figures and results of your paper be made available in one of the following forms:

1) Supplementary files (e.g., excel). Please ensure that all data files are uploaded as 'Supporting Information' and are invariably referred to (in the manuscript, figure legends, and the Description field when uploading your files) using the following format verbatim: S1 Data, S2 Data, etc. Multiple panels of a single or even several figures can be included as multiple sheets in one excel file that is saved using exactly the following convention: S1_Data.xlsx (using an underscore).

2) Deposition in a publicly available repository. Please also provide the accession code or a reviewer link so that we may view your data before publication.

Regardless of the method selected, please ensure that you provide the individual numerical values that underlie the summary data displayed in the following figure panels as they are essential for readers to assess your analysis and to reproduce it: Figs 3ABC, 4BC, S1ABCD, S2, S3ABC, S4ABC, S5ABC, S6, S7, S8B. NOTE: the numerical data provided should include all replicates AND the way in which the plotted mean and errors were derived (it should not present only the mean/average values).

IMPORTANT: Please also ensure that figure legends in your manuscript include information on where the underlying data can be found, and ensure your supplemental data file/s has a legend.

Please ensure that your Data Statement in the submission system accurately describes where your data can be found.

------------------------------------------------------------------------

DATA NOT SHOWN?

- Please note that per journal policy, we do not allow the mention of "data not shown", "personal communication", "manuscript in preparation" or other references to data that is not publicly available or contained within this manuscript. Please either remove mention of these data or provide figures presenting the results and the data underlying the figure(s).

------------------------------------------------------------------------

REVIEWERS' COMMENTS:

Reviewer #2:

The authors have submitted a very thorough revision of their manuscript. I think it now covers what was required from my side. Just two minor comments:

In the sentence "30 plants, which are known to complex histories of gene duplication and loss" should be "known to have"

They mention: "In comparison, 15 SC-OGs were identified in the plant dataset; 2,782 in the budding yeast dataset; and 390 in the choanoflagellate dataset", there are more SC-OGs in the yeast dataset than SNAP-OGs, could that be a typo? If not, what is the explanation for this?

Reviewer #3:

[identifies himself as Erich Jarvis]

The authors were very responsive to the reviews, and have made substantial improvements to the manuscript. This includes performing many additional analyses. All of our main concerns have been satisfied. Just two clarifications needed.

The authors stated that OMA is sequence based, whereas OrthoSNAP is phylogeny based. But isn't the phylogeny of the genes based on sequence alignments?

The explanation of lineage sorting for not identifying all one-to-one orthologs makes sense. The authors can cite some papers where they show a high proportion of genes (5-30%) that can have incomplete lineage sorting between species (e.g. Jarvis et al 2014 Science).

Decision Letter 3

Roland G Roberts

13 Sep 2022

Dear Dr Rokas,

Thank you for the submission of your revised Methods and Resources "­­OrthoSNAP: a tree splitting and pruning algorithm for retrieving single-copy orthologs from gene family trees" for publication in PLOS Biology. On behalf of my colleagues and the Academic Editor, Andreas Hejnol, I'm pleased to say that we can in principle accept your manuscript for publication, provided you address any remaining formatting and reporting issues. These will be detailed in an email you should receive within 2-3 business days from our colleagues in the journal operations team; no action is required from you until then. Please note that we will not be able to formally accept your manuscript and schedule it for publication until you have completed any requested changes.

Please take a minute to log into Editorial Manager at http://www.editorialmanager.com/pbiology/, click the "Update My Information" link at the top of the page, and update your user information to ensure an efficient production process.

PRESS: We frequently collaborate with press offices. If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximise its impact. If the press office is planning to promote your findings, we would be grateful if they could coordinate with biologypress@plos.org. If you have previously opted in to the early version process, we ask that you notify us immediately of any press plans so that we may opt out on your behalf.

We also ask that you take this opportunity to read our Embargo Policy regarding the discussion, promotion and media coverage of work that is yet to be published by PLOS. As your manuscript is not yet published, it is bound by the conditions of our Embargo Policy. Please be aware that this policy is in place both to ensure that any press coverage of your article is fully substantiated and to provide a direct link between such coverage and the published work. For full details of our Embargo Policy, please visit http://www.plos.org/about/media-inquiries/embargo-policy/.

Thank you again for choosing PLOS Biology for publication and supporting Open Access publishing. We look forward to publishing your study. 

Sincerely, 

Roli Roberts

Roland G Roberts, PhD, PhD

Senior Editor

PLOS Biology

rroberts@plos.org

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Fig. Numbers of orthogroups, single-copy orthogroups, orthogroups with 1 or more homologs in 1 species, and the number of SNAP-OGs identified for each dataset.

    (A) The total number of orthogroups with at least 50% ortholog occupancy for each dataset. (B) The number of single-copy orthologs (SC-OGs) for each dataset (with at least 50% taxon occupancy). (C) The number of multicopy orthologs (or orthologous groups of genes wherein 1 or more species is represented by 2 or more sequences; MC-OGs) for each dataset (with at least 50% taxon occupancy). (D) The number of SNAP-OGs identified in each dataset (with at least 50% taxon occupancy). Note that the numbers depicted in panel A reflect the sum of the numbers of SC-OGs and MC-OGs in panels B and C. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S2 Fig. The number of SNAP-OGs identified in orthologous groups of genes with 2 or more homologs in 1 or more species.

    The number of SNAP-OGs per orthologous group of genes is depicted on the x-axis. For example, in the budding yeasts dataset, 977 gene families had 1 SNAP-OG each. The highest number of SNAP-OGs identified in a single orthologous group of genes in each dataset were as follows: in budding yeasts, 5 SNAP-OGs were identified in 1 orthologous group of genes that encode transcriptional activators; in filamentous fungi, 5 SNAP-OGs were identified in each of 2 orthologous groups of genes that encode multifacilitator superfamily transporters and amino acid permeases; and in mammals, 4 SNAP-OGs were identified in each of 3 orthologous groups of genes that encode voltage-gated potassium channels, casein kinases, and a tropomyosin family of actin-binding proteins. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S3 Fig. The 10 most frequent best-fitting substitutions models are similar between SC-OGs and SNAP-OGs.

    The top 10 most frequently observed best-fitting substitutions models were similar between SC-OGs and SNAP-OGs among (A) 1,668 SC-OGs and 1,392 SNAP-OGs in budding yeasts, (B) 4,393 SC-OGs and 2,035 SNAP-OGs in filamentous fungi, and (C) 321 SC-OGs and 1,775 SNAP-OGs in mammals. For example, the LG+F+I+G4 model was the most frequently observed best-fitting substitution model in SC-OGs and SNAP-OGs from budding yeasts. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S4 Fig. Distributions of information content among SNAP-OGs and SC-OGs.

    Boxplot and violin plot distributions of 9 properties representative of phylogenetic information are depicted SNAP-OGs (blue) and SC-OGs (orange) in the (A) 1,668 SC-OGs and 1,392 SNAP-OGs in budding yeasts, (B) 4,393 SC-OGs and 2,035 SNAP-OGs in filamentous fungi, and (C) 321 SC-OGs and 1,775 SNAP-OGs in mammals. Abbreviations are as follows: average bootstrap support (ABS), degree of violation of the molecular clock (DVMC), relative composition variability, Robinson-Foulds distance (RF distance), alignment length (Aln. len.), the number of parsimony informative sites (PI sites), saturation, treeness (tness), and treeness/RCV (tness/RCV). The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S5 Fig. Quality of representation and contributions of properties of phylogenetic information content during principal component analysis.

    Principal component analysis was used to qualitatively compare the similarities and differences between SNAP-OGs and SC-OGs (Fig 3). The leftmost figure in each panel of budding yeasts (A), filamentous fungi (B), and mammals (C) represents the quality of representation for each property across all principal components. The next 2 figures depict the contribution of each property (or variable) to the first and second dimension in reduced dimensional space. The red dashed line represents equal contributions from each variable. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S6 Fig. The number of SNAP-OGs identified in an orthologous group of genes with 2 or more homologs in 1 or more species for the dataset used to examine a contentious branch in the tree of life.

    The number of SNAP-OGs per orthologous group of genes is depicted on the x-axis. For example, a single SNAP-OG was identified in 1,330 gene families with 2 or more homologs in 1 or more species, whereas 4 SNAP-OGs were identified in 2 gene families with 2 or more homologs in 1 or more species. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S7 Fig. The 10 most frequently observed best-fitting substitutions models are similar between SC-OGs and SNAP-OGs in the dataset used to examine a contentious branch in the tree of life.

    Similar best-fitting substitutions models were observed between 252 SC-OGs and 1,428 SNAP-OGs in a dataset of mammals, which was used to investigate patterns of support in a contentious branch in the tree of life concerning deep evolutionary relationships among placental mammals. The data underlying this figure can be found in figshare (doi: 10.6084/m9.figshare.16875904).

    (TIF)

    S8 Fig. Cartoon comparison of different tree decomposition algorithms.

    Using the phylogeny presented in Fig 1B (panel A) and Fig 2B (panel B), different tree decomposition algorithms are compared. (A) OrthoSNAP will identify 4 SNAP-OGs, whereas DISCO and the maximally inclusive strategies will each identify 3 subgroups of orthologous genes. PhyloTreePruner will not identify any subgroups of single-copy orthologous genes. (B) OrthoSNAP will identify 5 subgroups of single-copy orthologous genes (light blue) by identifying maximally inclusive subgroups—subtrees where each taxon is represented by a single sequence—and maximally inclusive subgroups after species-specific inparalog trimming (species-specific inparalogs are shown in orange). In contrast, DISCO and maximally inclusive strategies will identify 3 SC-OGs, in part, because they do not account for species-specific inparalogs. PhyloTreePruner, which only prunes species-specific inparalogs, will not identify any subgroups of single-copy orthologous genes due to the presence of more ancient duplication events.

    (TIF)

    S1 Table. Species and accession numbers for proteomes used in each dataset.

    This table details the species used for the budding yeasts, filamentous fungi, and mammalian datasets. All proteomes from budding yeasts were downloaded from Shen and colleagues [51]. Proteomes from filamentous fungi and mammals were downloaded from NCBI, and their accessions and assembly names are provided.

    (XLSX)

    S2 Table. Number of orthogroups examined.

    A table of the number of orthogroups, the number of SC-OGs, the number of gene families with orthologs and paralogs (MC-OGs), and the number of SNAP-OGs examined in the present study.

    (XLSX)

    S3 Table. Ortholog occupancy for each dataset.

    A table summarizing the average and standard deviation of taxon completeness in SC-OGs and SNAP-OGs.

    (XLSX)

    S4 Table. Nine properties of phylogenetic information content.

    Phylogenetic information content of SC-OGs and SNAP-OGs were examined using the 9 properties described here. The abbreviation, description, additional notes, and function in PhyKIT used to calculate each property are listed here.

    (XLSX)

    S5 Table. Multifactor analysis of variance results reveals no substantial differences between SC-OGs and SNAP-OGs.

    Degree of freedom, sum of squares, mean square, F-value, and p-value for multifactorial analysis of variance are shown here. Multifactorial analysis of variance was conducting accounting for potential interaction effects as well as using an additive model, which does not account for interaction effects.

    (XLSX)

    S6 Table. Tree certainty and tree certainty-all results.

    Examining tree certainty and tree certainty-all revealed similar levels of incongruence among gene trees inferred using SC-OGs and SNAP-OGs.

    (XLSX)

    S7 Table. Dataset for examining deep evolutionary relationships among eutherian mammals.

    The NCBI accession, assembly name, name in files, and ingroup/outgroup designations are detailed here for each proteome used.

    (XLSX)

    S8 Table. Number of orthogroups examined among eutherian mammals.

    A table of the number of orthogroups, the number of SC-OGs, the number of gene families with orthologs and paralogs (MC-OGs), and the number of SNAP-OGs examined among eutherian mammals.

    (XLSX)

    S9 Table. Gene support frequency results among ancient eutherian mammalian relationships.

    Gene support frequency results reveal similar levels of support between the 3 hypotheses concerning deep evolutionary divergences among mammals. Multitest corrected p-values are also shown here.

    (XLSX)

    S10 Table. Comparison between different algorithms that identify subgroups of orthologous genes or conduct species-specific inparalog trimming.

    Notably, OrthoSNAP provides the most user flexibility and handles the most use cases.

    (XLSX)

    Attachment

    Submitted filename: response_to_reviewers_orthosnap.docx

    Attachment

    Submitted filename: response_to_reviewers_orthosnap.docx

    Data Availability Statement

    All results and data presented in this study are available from figshare (doi: 10.6084/m9.figshare.16875904).


    Articles from PLOS Biology are provided here courtesy of PLOS

    RESOURCES