Abstract
Animals and fungi have radically distinct morphologies, yet both evolved within the same eukaryotic supergroup: Opisthokonta1,2. Here we reconstructed the trajectory of genetic changes that accompanied the origin of Metazoa and Fungi since the divergence of Opisthokonta with a dataset that includes four novel genomes from crucial positions in the Opisthokonta phylogeny. We show that animals arose only after the accumulation of genes functionally important for their multicellularity, a tendency that began in the pre-metazoan ancestors and later accelerated in the metazoan root. By contrast, the pre-fungal ancestors experienced net losses of most functional categories, including those gained in the path to Metazoa. On a broad-scale functional level, fungal genomes contain a higher proportion of metabolic genes and diverged less from the last common ancestor of Opisthokonta than did the gene repertoires of Metazoa. Metazoa and Fungi also show differences regarding gene gain mechanisms. Gene fusions are more prevalent in Metazoa, whereas a larger fraction of gene gains were detected as horizontal gene transfers in Fungi and protists, in agreement with the long-standing idea that transfers would be less relevant in Metazoa due to germline isolation3–5. Together, our results indicate that animals and fungi evolved under two contrasting trajectories of genetic change that predated the origin of both groups. The gradual establishment of two clearly differentiated genomic contexts thus set the stage for the emergence of Metazoa and Fungi.
Subject terms: Molecular evolution, Microbiology, Computational biology and bioinformatics
Detailed ancestral gene content reconstruction shows that the large phenotypic differences between Metazoa and Fungi are the outcome of sharply contrasting trajectories of genomic changes that predated the origin of both groups.
Main
One of the most surprising early insights of molecular phylogenetics was the close evolutionary relationship between animals and fungi6, which was unexpected because of the enormous differences in their morphology, ecology, life history and behaviour. This relationship has stood the test of time, and now animals and fungi are members of Holozoa and Holomycota, respectively, which are the two major divisions of the eukaryotic supergroup Opisthokonta1. Pinpointing how animals and fungi evolved to be so different requires a detailed reconstruction of the evolutionary changes leading up to the two lineages. This demands not only genomic data from diverse animals and fungi but also from the protist opisthokont groups that branch between them (Fig. 1d), which are underrepresented in genomic databases7.
Four new genomes of protist opisthokonts
The closest known groups to Metazoa within Holozoa are Choanoflagellatea, Filasterea and Teretosporea (Fig. 1d). Within Holomycota, the closest known groups to Fungi (here defined as the least inclusive clade including Chytridiomycota and Blastocladiomycota based on the absence of phagotrophy in all the members of this clade8) are Opisthosporidia (a paraphyletic group9,10, which in our genomic dataset is represented by Rozella allomycis and Mitosporodium daphniae—RM clade) and Nucleariidae (Fig. 1d). To improve the limited genome sampling for the protist opisthokont groups7, we sequenced, assembled and annotated the genomes of three filastereans (Ministeria vibrans11, Pigoraptor vietnamica12 and Pigoraptor chileana12) and one nucleariid (Parvularia atlantis13) from metagenomic data produced from cultures of these species (Supplementary Information 1). Given that Filasterea and Nucleariidae were previously represented by only a single whole-genome-sequenced species, the four newly sequenced species represent a substantial increase in the diversity of genomic data available for the protist opisthokont groups (Fig. 1d). This can be expected to minimize the negative impact of poor taxon sampling in ancestral reconstructions (see an example of this issue in Extended Data Fig. 1a).
The four sequenced genomes present high completeness and contiguity metrics, which are in the range of those from the previously sequenced protist opisthokont species (Fig. 23 in Supplementary Information 1). With regard to genome size and gene content metrics, the sequenced species are not different from most unicellular eukaryotes and fungi (Extended Data Figs. 2 and 3) with the exception of P. atlantis. Despite having a compact genome (19.24 Mb), this nucleariid presents 8.58 introns per gene (Extended Data Fig. 3a). This ratio is almost identical to Homo sapiens, despite the introns of P. atlantis being approximately 86 times shorter (60.67 mean bp size) (Extended Data Fig. 3b), giving it an intron density (approximately four introns per kilobase) more than twice that of any other genome explored (Extended Data Fig. 1b).
Large differences in gene content
We explored whether the gene contents of Metazoa and Fungi present broad-scale functional differences as this would be indicative that, at some point after the divergence of their last common ancestor, a substantial genetic turnover occurred (that is, the remodelling of the gene content as a result of gene gains and losses, with gains including the origination of novel gene families and the expansion of ancestral families). In a multivariate analysis of the relative genomic representation of each Cluster of Orthologous Groups functional categories14 (hereafter referred to as functional categories), Metazoa and Fungi cluster separately in the dimension accounting for the largest variance explained (68.1%) (Fig. 2a). Functional categories of signal transduction (T), transcription (K) and extracellular structures (W), which are particularly relevant for animal multicellularity15,16, are among the most differentially represented in animal genomes (particularly T and W; Extended Data Fig. 5a). Other categories that are more represented in Metazoa include cytoskeleton (Z) and cell motility (N) (Fig. 2a). By contrast, the vast majority of metabolic functional categories (C, E, F, G, H, I and Q; see Fig. 1c) are proportionally more represented in Fungi (Fig. 2a).
Greater divergence of metazoan gene sets
From an evolutionary perspective, the large genetic differences shown between Metazoa and Fungi might be explained because either both or just one of the two groups experienced substantial genetic changes after diverging from their last shared common ancestor. Furthermore, this divergence could either be due to an abrupt genetic turnover in which changes would have occurred specifically in the root of both groups, or by a gradual process in which the preceding ancestors of each group were already accumulating changes in the direction of the differences observed in extant Metazoa and Fungi (Fig. 2a). To distinguish between these alternative scenarios, we took two complementary approaches to reconstruct the tempo and modes of the genetic divergence that occurred. In the first approach, we split the functional categories into two groups based on the results from the multivariate analysis on extant species from Metazoa and from Fungi (Fig. 2a): Metazoa-related or Fungi-related. Then, we computed the relative representation of each group of functional categories in every ancestral node of Opisthokonta (Fig. 1a) based on the gene contents inferred with our ancestral reconstruction pipeline (see Methods). In the second approach, we trained a series of machine learning classifiers to find their own functional category-based definition based on the gene contents from extant Metazoa and Fungi (see Methods). Then, we scored the ancestral nodes—which were not used to train the classifiers—according to how metazoan-like and fungal-like the relative compositions of functional categories of their inferred gene contents were (Extended Data Fig. 4d).
Not surprisingly, Fungi-related functional categories are more represented in Fungi (particularly in Basidiomycota and Ascomycota groups), but for most of the non-metazoan and non-fungal opisthokonts, the relative genomic representation of functional categories is more Fungi-like than Metazoa-like (Fig. 1d). As a result, Fungi does not separate from the protist opisthokont groups as distinctly as Metazoa (Extended Data Fig. 6b). These results are consistent with the fact that the machine learning classifiers differentiate the functional category compositions of Metazoa more strongly than those of Fungi (Extended Data Fig. 4d), as shown by the lower probabilities retrieved for the inner nodes of Fungi (43.7% for F3, root of Fungi) than those retrieved for Metazoa (81.7% for M4, root of Metazoa). Together, these results indicate that Metazoa experienced a broader differentiation at the gene function level than Fungi, with fungal gene contents being more similar to those of the protist opisthokonts, including the root of Opisthokonta (Fig. 1d and Extended Data Fig. 6c).
Gradual process, punctuated acceleration
Our ancestral reconstruction shows the genetic differences between Metazoa and Fungi (Fig. 2a) stemming from a divergence that started early after the split of Opisthokonta and continued up to the origin of the two groups (Fig. 2b,c). In the path to Metazoa, the changes that occurred in the three pre-metazoan ancestors (M1–M3) together account for a contribution of a similar magnitude to shifting the composition of the lineage towards Metazoa-related functional categories than those changes occurred in the metazoan root (3.7% versus 3.5%; Fig. 1d). Among the pre-metazoan ancestors, the changes in M2 and M3 contributed more than the changes in M1 despite both nodes showing fewer net gene gains (Fig. 1a). This is explained because gains in M1 were distributed across a wider set of functional categories, whereas gains in M2 occurred particularly in Metazoa-related functional categories, and the net losses in M3 were more prevalent in Fungi-related functional categories (Fig. 1a). Notwithstanding the contribution of the pre-metazoan ancestors, at the root of Metazoa (M4) there is also evidence for a substantial burst of net gains from a subset of functional categories (Fig. 1b), including transcription (K), signal transduction (T) and extracellular structures (W), which are particularly relevant for the animal multicellular genetic toolkit15. Although in the pre-genomic era the animal multicellular genetic toolkit was largely expected to be the outcome of metazoan-specific genetic innovations (that is, gene families that originated at the metazoan root), comparative genomics has revealed orthologues of many toolkit components in the unicellular relatives of animals15,17–19. This finding highlighted the importance that the co-option of ancestral gene originations had for multicellularity, although those same studies, as well as more recent studies19–21, also reported remarkable gene originations at the metazoan root. To quantify what contributed more to the pool of gene families involved in functions that are particularly important for multicellularity (K, T and W), whether pre-metazoan gene originations from Holozoa or those that occurred at the metazoan root, we traced the evolutionary trajectories of these three categories after the divergence of Opisthokonta.
Of gene gains observed at the metazoan root for K, T and W categories, 42.8% correspond to gene families that originated in this same ancestor (M4), whereas 21.2% of gains in M4 correspond to the expansion of gene families that originated in the pre-metazoan holozoan ancestors (Extended Data Fig. 6d). This difference (42.8% to 21.2%) is much greater than the observed for the other functional categories (19.2% to 15.9%), indicating that among the gene gains that occurred at M4, gene originations were particularly relevant for K, T and W at the metazoan root. An inspection of the ancestral contribution to the gene content of H. sapiens (Extended Data Fig. 6e) illustrates the same trend: genes from families originated in M4, a single ancestral node, contributed in a similar extent to the ancestral repertoire of the genes involved in K, T and W in H. sapiens (mean of 13.9%) than genes from families originated in the three pre-metazoan ancestral nodes (M1–M3) (mean of 12.5%). From this, we conclude that gene originations at M4 have been quantitatively more important (13.9% versus 12.5%) to functional categories related to animal multicellularity than the gene originations coming from any of the preceding holozoan ancestors. As a result, the metazoan root experienced a substantial increment in the relative genomic representation of K, T and W (+1.35%, +1.16% and +0.35%, respectively, from M3 to M4) (Extended Data Fig. 6f). Notwithstanding this, the tendency towards increasing the relative genomic representation of these functional categories was already ongoing in the pre-metazoan holozoan ancestors (+1.73%, +0.66% and +0.24%, respectively, from O to M3) and hence predated the origin of animals (Extended Data Fig. 6f).
Main genetic changes in Fungi
Similar to Metazoa, the genetic changes that occurred in the preceding ancestors of Fungi from Holomycota (F1 and F2) contributed more to shifting the gene content (1.8% together)—in this case, towards Fungi-related functional categories—than the root of the group (0.07%) (Figs. 1d and 2c). However, whereas the ancestral path to Metazoa from M1 to M3 accumulated net gains of Metazoa-related functional categories, F1 and F2 did not accumulate gains but rather losses of Metazoa-related functional categories, particularly signal transduction (Fig. 1a).
The two fungal nodes that present the largest compositional shift towards Fungi-related functional categories are, on the one hand, the stem node of Dikarya (Ascomycota + Basidiomycota) (+1.9%; Fig. 1d), which experienced genetic changes that could have predisposed the evolution of complex multicellularity in some members of this group (see Supplementary Information 4), and on the other hand, the last common ancestor of Zoopagomycota, Mucoromycotina and Dikarya (+1.5%), which experienced important morphological adaptations such as the ancestral loss of the flagellum that is characteristic of most fungal groups22. On average, and in contrast to animals, Fungi retained gene contents of a similar size to their ancestors and the protist opisthokonts (Extended Data Fig. 7). Still, some fungal nodes showed substantial net gains, particularly the fungal root (F3; Fig. 1b). Similar to the animal root in Holozoa, F3 was the node in Holomycota with the largest fraction of gene gains being explained by gene originations (Extended Data Fig. 8). Nevertheless, the changes seen at the fungal root made a low contribution to the compositional shift of Fungi (0.07%; Fig. 1d) because this node accumulated net gains of both Metazoa and Fungi-related functional categories (Fig. 1b).
The main characteristic of the genetic turnover that occurred in the path to extant Fungi was a specialization towards metabolism (Fig. 2d), whereas animal genomes specialized towards other functional categories (Fig. 2a). In agreement with this, the metazoan root experienced a net loss of metabolic genes (Extended Data Fig. 5d), despite this node presenting an overall net gain of gene content (Fig. 1b), whereas the fungal root experienced net metabolic gene gains (Extended Data Fig. 5c). (Note that an additional supplementary analysis with a dataset that includes transcriptomic data from the aphelid Paraphelidium tribonemae9, which is the closest known group to Fungi, suggests that half of the net gene gains originally detected at the fungal root, including the metabolic ones, could have also predated the origin of Fungi; see Supplementary Fig. 2).
The metabolic changes at the gene content level that we described for the root of Metazoa and Fungi did not become a tendency that continued during the diversification of both groups, as we detected a net accumulation of metabolic genes in Metazoa, but not in Fungi (Extended Data Fig. 5c,d). The larger representation of metabolism in fungal genomes is thus explained because the gene turnover that occurred during the diversification of Fungi benefitted the metabolic over the non-metabolic functions (Fig. 2d). By contrast, Metazoa accumulated more genes of every category, but gains were not particularly biased towards metabolic functions (Fig. 1c).
Differences in gene gain mechanisms
Metazoa and Fungi also differ in their preferences among the mechanisms that can be sources of gene gains. Although no significant differences between groups were found in the relative contribution of gene originations to gene gains, gene duplications were found to be significantly more prevalent specifically among metazoan gains (Fig. 3a,b), in accordance with previous studies that highlighted the importance of duplications in the origin and diversification of animals21. Besides originations and duplications, the gene tree–species tree reconciliation software23 used in our ancestral reconstruction framework also estimates putative horizontal gene transfer events as sources of gene gains. Despite being originally described in Bacteria, horizontal gene transfer has been documented across a wide range of eukaryotes and is known to have led to significant functional changes24–27. However, the relative contribution of transfers to gene gains in eukaryotes, and whether this contribution is homogeneous across the phylogeny, remain uncertain28–30. In this regard, the fact that the reconciliation software recovered a significantly lower fraction of gene gains as being explained by transfers in Metazoa than in Fungi and in the other opisthokonts (Fig. 3c) is compatible with the historical consideration that transfers should contribute less to gene gains in animals due to germline isolation3–5.
Our ancestral reconstruction pipeline also detects originations that occurred due to gene fusion events. Previous studies17,18 have described multiple instances of genes in the animal multicellular toolkit that originated through gene fusions (here defined as the merging of partial or complete sequences from older genes). Our results indicate that fusions contributed significantly more to gene gains in Metazoa than in Fungi (Fig. 3d). This is not only explained because Metazoa experienced more gene gains than Fungi (Extended Data Fig. 7), but also because the fraction of originations detected as fusions are also greater in Metazoa (Extended Data Fig. 9). Fusions being less prevalent in Fungi agrees with a previous study that reported a particularly low rate of fusions compared with fissions31. Because fusions seem to be particularly relevant sources of transcription and signal transduction genes (Extended Data Fig. 5e,f), this gene gain mechanism could have been more prevalent in Metazoa due to the excess of gains of these two categories (Fig. 1a,b), which are particularly relevant for multicellularity15.
Two divergent genomic trajectories
Together, the emerging picture from our ancestral reconstruction indicates that animals and fungi have been evolving under sharply contrasting trajectories of genomic changes that predated the origin of both groups (Fig. 4). Fungal gene contents remained relatively constant in size (Extended Data Fig. 7) and specialized into metabolism (Fig. 2d). By contrast, animals accumulated net gains of most functional categories, although the unequal distribution of gene gains across categories led some categories to increase their relative genomic representation over the others, particularly those that are important for multicellularity (Extended Data Fig. 6f). Although both groups experienced substantial gains and losses during their divergence (Extended Data Fig. 10), the lineage leading to extant Metazoa experienced a larger compositional change in gene function (Fig. 2b,c). As a result, metazoan gene contents are more diverged than the fungal gene contents from those of the other opisthokonts at both the broad-scale functional level and the gene family content level (Extended Data Fig. 6c,g). Given that the latter result is independent of gene function annotation, Metazoa being more differentiated than Fungi from the rest of opisthokonts from a gene content perspective is robust to potential inequalities that may exist between groups at the level of biological knowledge or in the availability of functional information. This indeed agrees with the fact that there are more evident morphological discontinuities between protists and animals than between protists and some groups of Fungi8. Neither the hypha nor the cell wall characteristic of Fungi, which is also present in some of their protist relatives, are fungal synapomorphies8. Only the abandonment of phagotrophy for an osmotrophic lifestyle seems to be a common although not exclusive feature of Fungi32. Although animals distinguish from protists from the fact that all of them are multicellular, in Fungi, complex multicellularity is probably the outcome of convergent evolution as it is only found in some particular groups, which present important differences in the genetic contents involved on it33 (see Supplementary Information 4 for further information on the evolution of multicellularity in Opisthokonta and particularly in Fungi).
From a genomic perspective, the origin of Metazoa and Fungi is better described as a gradual rather than an episodic process given the contribution of their preceding ancestors (M1–M3 and F1–F2) to the cumulative changes at the level of gene function that occurred in the lineages leading to the extant representatives of both groups (Fig. 2b,c). Notwithstanding this, substantial quantitative changes in gene content also occurred concomitantly with the origin of the two groups (Fig. 1b). In particular, the genetic changes at the metazoan root represent an acceleration of a trend that was already ongoing in the pre-metazoan ancestors to accumulate genes of functional categories that are important for animal multicellularity (Extended Data Fig. 6f). These same categories underwent losses in the pre-fungal ancestors (Fig. 1a), situating the immediate ancestors of Fungi and Metazoa in substantially different latent potentials from a genomic perspective. This is especially relevant for the case of animals. Had not animal ancestors experienced a continuous and long-standing evolutionary trajectory that had a compounding effect on the genomic potential for multicellularity, metazoans could not have arisen. The origin of animals may be seen as a drastic evolutionary event, but our taxon-rich analysis shows how the potential for that to happen was generated gradually on a genomic level. Our results illustrate the importance of analysing evolutionary transitions in the light of their evolutionary prehistory.
Methods
Methodological pipeline for genomic data acquisition
We sequenced a series of culture lines, each including one of the four species of interest (M. vibrans, P. atlantis, P. vietnamica and P. chileana). The cultures of M. vibrans and P. atlantis (formerly Nuclearia sp.) were bought in ATCC (M. vibrans Tong. ATCC 50519 and Nuclearia sp. ATCC 50694, respectively). The cultures of P. vietnamica (formerly Opistho-1) and P. chileana (formerly Opistho-2) descend from the environmental isolates (P. vietnamica from a Freshwater Lake, Vietnam; and P. chileana from freshwater temporary water body, Chile) used in ref. 12. As expected, the starting cultures included an uncertain fraction of contaminant species. In particular, the cultures of M. vibrans and P. atlantis included an uncertain diversity of bacterial contamination, whereas the cultures of each Pigoraptor species also included contamination from the eukaryote Parabodo caudatus. The sequenced metagenomic data were submitted to a bioinformatic decontamination pipeline that consisted of two to three rounds of detection and removal of contaminant fragments based on taxonomic and tetranucleotide composition information. All steps were thoroughly supervised to maximize the retention of bona fide genomic fragments from our species of interest and the removal of contaminant sequences. Decontaminated genomes were annotated combining both RNA sequencing-based BRAKER1 v1.9 (ref. 34) and PASA v2.0.2 (ref. 35) automatic annotation pipelines, the results of which were processed to correct erroneous gene predictions that might lead to the inference of false gene fusions. See Supplementary Information 1 for a detailed explanation about the nature of the sequenced data and the decontamination and genome annotation processes (see Fig. 1 in Supplementary Information 1 for a summary illustration).
Clustering sequences into orthogroups
A dataset of 1,463,920 protein sequences from 83 eukaryotic species, 59 from Opisthokonta (including the four genomes produced) and 24 from other eukaryotic groups, was constructed (draft_euk_db; see Supplementary Table 4). Protein sequences were aligned all-against-all using BLASTp36 v2.5 [-seg yes, -soft_masking true, -evalue 1e-3]. On the basis of the alignments, proteins were clustered into orthogroups (OGs) with OrthoFinder37 v2.7 [-I 2]. We treat OGs as proxies of gene families. The OGs produced by OrthoFinder were processed with the MAPBOS pipeline to fix protein domain heterogeneity problems that would compromise downstream analyses (see Supplementary Information 2 for a discussion of this issue, and for an explanation of the algorithm that we developed to correct it).
Species tree reconstruction
Ancestral gene contents were inferred by means of a gene tree–species tree reconciliation software. We thus needed to reconstruct a phylogenetic tree for every gene family and a species tree of the whole eukaryotic supergroup Opisthokonta. The results from the species tree reconstruction analyses are available in Supplementary Information 3. We first selected 342 OGs present in >77% of draft_euk_db taxa and with no more than an average of 1.16 copies per taxa. We measured alignment instability of the 342 OGs using COS.pl and msa_set_score v2.02, which are based on the Heads-or-Tails approach38,39, keeping only those OGs with >0.70 mean column score (MCs). We manually curated the 69 OGs that survived to this filter by performing individual phylogenies for each one, using MAFFT40 v7.123b [-einsi] for sequence alignment, trimAl41 v1.4.rev15 [-gappyout] for alignment trimming and IQ-TREE42 v1.6.7 for maximum-likelihood (ML) phylogenetic inference, using ModelFinder43 for model selection. Three of these 69 OGs were discarded because the topology was strongly in disagreement with the expected species topology. For the remaining 66 OGs (hereafter referred to as the MCs70 dataset), we removed sequences whose branching pattern suggested that they were most likely misclassified as OG members. In addition, to keep only one sequence per taxon in every OG, for inparalogue cases, we kept the least divergent sequence according to branch length. We removed a total of 630 sequences from the MCs70 dataset, including likely misclassified OG members but also contaminant sequences. Most contamination cases found correspond to contamination from Stramenopiles in the proteome of Syssomonas multiformis, probably from Spumella sp.12. However, we also detected Pirum gemmata contamination in the proteome of Abeoforma whisleri, and few from Ichthyophonus hoferi in Sphaerothecum destruens, indicating cross-contamination problems between these ichthyosporeans datasets. Still, these cases of contamination neither affected the phylogenetic inference, as they were removed during the screening, nor the downstream analyses, as these species were only used for species tree reconstruction purposes.
We created two distinct versions of the MCs70 dataset: the first dataset including all sequences from Holozoa (ingroup) and from three Holomycota taxa (outgroup) (Holozoa MCs70), and the second dataset including all sequences from Holomyoca (ingroup) and from three Holozoa taxa (outgroup) (Holomycota MCs70). An alignment supermatrix was created for each dataset, first aligning and trimming each OG per separate [MAFFT -einsi, trimAl -gappyout], and later concatenating the alignments into a supermatrix (Holozoa MCs70: 37 taxa, 17,475 sites and 9.27% of missing data; Holomycota MCs70: 28 taxa, 17,409 sites and 7.81% of missing data). We constructed a phylogenetic tree for both MCs70 datasets using ML and Bayesian inference. ML inferences were done with IQ-TREE, and the models chosen for Holozoa and Holomycota MCs70 datasets were LG+C50+F+R7 and LG+C30+F+R6, respectively. Despite ModelFinder suggesting the usage of C60 (ref. 44) for Holomycota MCs70, we used mixture models with fewer profiles to avoid potential model overfitting, as some optimized mixture weights were estimated close to zero. Nodal supports for the ML trees consisted of 1,000 IQ-TREE ultrafast bootstrap replicates (UFBoot) and 100 standard non-parametric bootstrap replicates. Non-parametric bootstraps were computed under the PMSF model45. We used the previously inferred ML trees as guide trees to infer mixture model parameters and site-specific frequency profiles, as implemented in IQ-TREE v1.6.7. Bayesian phylogenies were done under the CAT+GTR+Gamma(4) model in PhyloBayes-MPI46 v1.8. Two chains were run for Holozoa MCs70 and for Holomycota MCs70 supermatrices, and convergence was assessed using the bpcomp and tracecomp programs in the PhyloBayes-MPI package. Consensus trees were built when the maximum between chain discrepancy in bipartition frequencies fell below 0.1 (burn-in 33%). We also performed three additional analyses (increasing number of positions in the supermatrix, compositional recoding and fastest-evolving sites removal) to test the robustness of the topological relationships found (see Supplementary Information 3).
Incorporation of prokaryotic homologues into the OGs
We incorporated prokaryotic homologues into the clusters before the MAPBOS processing step. For the incorporation of prokaryotic (and viral) homologues into the clusters, we first used DIAMOND47 v0.8.22.84 [--more-sensitive, -e 1e-05] to align all eukaryotic sequences from euk_db (a subset of draft_euk_db, which includes the species labelled in bold in Supplementary Table 4) to a database including 8,231,104 bacterial, 331,476 archaeal and 20,955 viral from Uniprot reference proteomes (release 2016_02; prok_db) (forward alignment approach). The aligned sequences from prok_db were aligned back against euk_db sequences (reverse alignment approach). Hits with a query and target alignment coverages lower than 75% were discarded, as well as hits in which the best-scoring euk_db target of a given prok_db query was a member of a distinct cluster than the best-scoring euk_db query for that prok_db sequence in the forward alignment. After discarding the hits not satisfying these conditions, we incorporated into the clusters only the best-scoring prok_db query of each euk_db target sequence (that is, if a cluster has 300 sequences and the best-scoring query of all them was the same prok_db sequence, only that sequence will be incorporated into the cluster, which will then have 300 euk_db sequences and 1 prok_db sequence). Prok_db sequences were incorporated into OrthoFinder -I 2 clusters before these were processed by the MAPBOS pipeline (Supplementary Information 3). After MAPBOS, clusters included 1,117,614 eukaryotic sequences and 58,017 non-eukaryotic sequences (53,168, 4,301 and 548 from Bacteria, Archaea and viruses, respectively). All these 1,175,631 sequences were distributed among 413,445 clusters, 370,686 of which are singletons. Among eukaryotic sequences, on a taxonomic level, clusters included sequences mostly from Opisthokonta (50 species), but also from 18 representatives of other major eukaryotic groups (euk_db dataset).
Gene tree inference and gene tree–species tree reconciliation analyses
We submitted every post-MAPBOS OGs (or clusters) to a gene tree inference pipeline, consisting of using MAFFT-linsi for the alignment step, trimAl [–gappyout] for alignment trimming and IQ-TREE for the phylogenetic inference. In particular, IQ-TREE was run using the LG+G4 model and sampling 1,000 optimized [-bnni] UFBoot replicates for every gene tree.
For the gene tree–species tree reconciliation analyses, we used ALEml_undated from ALE v0.4 (https://github.com/ssolo/ALE). ALEml_undated requires a distribution of phylogenetic trees for every gene family (the UFBoot replicates in our case) and a species tree. The Opisthokonta fraction of the species tree consisted of the most favoured topology according to our analyses, which only included Opisthokonta taxa (Fig. 1 in Supplementary Information 3). The phylogenetic relationships between the non-Opisthokonta taxa were directly determined from a consensus of currently available bibliographical references48–56 (all euk_db species were included in the reconciliation analyses). Reconciliation analyses also incorporated non-eukaryotic sequences (see above), which, for practical reasons, were assigned to the same terminal node in the species tree (named ‘Prokaryotes’ in Fig. 7 in Supplementary Information 3). Eukaryotes with only transcriptomic or poor-quality genomic data were excluded from the reconciliation analyses (those labelled in grey in Fig. 1 in Supplementary Information 3). Note that the inclusion of transcriptomic data would have been particularly problematic to our study for the following reasons: (1) gene content predictions from transcriptomic tend to present inflated gene counts. For example, the proteomes that were previously produced based solely on transcriptomic data for P. atlantis2 and for P. vietnamica and P. chileana12 include much more sequences (29,620, 46,018 and 37,783) than the proteomes that we predicted from the genome sequences of these species (9,028, 14,822 and 14,510), with the genome-based proteomes showing even better completeness metrics (Fig. 23 in Supplementary Information 1). Inflated gene counts are expected to produce an excess of duplication inferences in the reconciliations, whereas (2) unexpressed genes may be confused by gene losses. (3) Transcriptomes are harder to decontaminate due to the lack of genomic context information regarding neighbouring genes, intron sequences or compositional features of the coding sequence, whereas (4) those sequences predicted from partial isoforms are expected to lead to inaccuracies to the software used to detect gene fusions (see below). (5) Accurate gene contents were also important given that the reconciliation software used (see above) infers the values for parameters such as gene duplication and loss rates from the data.
Inference of gene fusion events
We used CompositeSearch57 to identify composite gene families, that is, families of genes whose protein sequence is composed by fractions—for example, protein domains—that are separately found in other, component, gene families. CompositeSearch requires as input all-against-all sequence alignments, for which we used the same BLASTp results used for OrthoFinder (see above), although alignment hits corresponding to draft_euk_db species not represented in euk_db were removed. Before being used as input for CompositeSearch, BLASTp results were preprocessed with cleanBlastp (included in CompositeSearch) to retain only the hit with the highest score among all hits involving the same query–target pair. CompositeSearch was run with the default parameters and forcing the software [-f] to work on the clusters resulting from the processing of the OG from OrthoFinder by the MAPBOS pipeline. Families with only one sequence were discarded as potential components [-y]. Prok_db sequences were not included in composite inferences as alignments between prok_db and euk_db sequences were done with DIAMOND instead of BLASTp due to computational time limitations. Because we work at the gene family level (clusters), we only considered as composites those clusters in which >50% of members were detected as composite sequences. This includes 48,066 clusters, 3,229 of which are not singletons.
CompositeSearch detects as a composite any sequence that matches with distinct subsets of sequences (components, from other OGs) in different regions of its sequence. Whereas fusion events may lead to composite sequences, not all sequences detected as composites necessarily originated from a gene fusion process. For example, a sequence found to be composite by the software could have originated de novo in a given ancestral lineage (gene X–domains A and B), and then, in a descendant lineage, that gene could have been split into two separate genes (gene Y–domain A and gene Z–domain B). In such a case of gene fission, the software would detect the gene X as a composite because some part of the sequence would be aligned by the gene Y (first component) and the other by the gene Z (second component). To retain only bona fide fusion composite sequences, we only considered those composite sequences in which all their components were inferred to have a more ancestral origin than the composite. This was done to minimize the false-positive inferences of fusions, at the expense of losing potential fusion events in which, for example, both the composite and the components may have originated in the same node of the phylogeny.
Functional annotation of sequences and OGs
Protein domain architectures of euk_db sequences and of prok_db captured sequences (see above) were determined with PfamScan58 using Pfam A v29. Cluster of Orthologous Groups functional categories (functional categories) and KEGG Orthology Groups (KOs)59 were annotated to euk_db sequences with eggNOG-mapper60 v1.0.3-3-g3e22728, using DIAMOND for the alignments of euk_db sequences against the eggNOG database (the functional category ‘S: unknown function’ was ignored as it does not include functional information). Once sequences were annotated, the functional categories and KO annotations of every cluster were determined by averaging the annotations of the corresponding cluster members. For example, if a cluster includes two sequences (SeqA and SeqB), and SeqA was annotated with the functional category K and SeqB with the functional categories B and K, that cluster would be annotated as 0.75K and 0.25B (0.5K from SeqA + 0.25K from SeqB, and 0.25B from SeqB).
Inference of gains, losses and counts of functional categories and metabolic gene contents
From the reconciliation analyses (see ‘Gene tree inference and reconciliation analyses’), we retrieved the number of gains, losses and gene contents of every OG in every node in the phylogeny. For every given node, we determined the absolute representation of all functional categories by crossing the information between the number of copies of every OG in the node and the relative representation of every functional category among the functional information of the OGs. The same was done to determine the KO contents of every node. The percentage of metabolic genes of every node was determined by dividing the number of KOs with metabolic annotations by the total number of genes in the node (besides KOs belonging to the ‘metabolic category’, those belonging to the category ‘membrane transport’ were also considered as metabolic genes). The relative representation of every functional category in every node was determined by dividing the absolute value of every category in the node by the sum of the absolute values of all functional categories in the node. Gains and losses of functional categories and KOs were determined by comparing the contents of every node with those of its immediately preceding node.
Statistical analyses
Statistical analyses were carried out either in Python, mainly with the libraries Pandas61 and NumPy62, or in R. All descriptive statistics plots (with the exception of those including phylogenetic trees, which were constructed with ITOL63) were done in R, particularly with the ggplot2 package64. Mann–Whitney U-tests (one-tailed) were done in Python with SciPy65 (scipy.stats.mannwhitneyu). More specific statistical analyses are detailed below.
Correspondence analyses of relative functional category compositions
The relative genomic representation of functional categories are examples of compositional data (CoDa)66, in which every column (a functional category) is represented by a relative fraction and the sum of all values is the same for every row (genome). Owing to the fact that no orthogonality and collinearity are properties of CoDa, most commonly used multivariate analyses techniques such as principal component analyses are unappropriated for CoDa analyses and alternatives such as correspondence analyses are recommended instead66. Correspondence analyses were done in R67 with FactoMiner package68 and the plots were constructed with the factoextra package69.
Machine learning classifiers
For the classifiers of metazoan and fungal functional category compositions, we benchmarked five widely used learning models: logistic regression, k-nearest neighbours classifier, support vector classifier, Random Forest and artificial neural network, fine-tuning in every case the model hyperparameters using fivefold cross-validation. In total, we generated two classifiers for every learning model: one trained to distinguish between the functional category compositions of metazoan versus the other terminal nodes in Opisthokonta; and another doing the same but for Fungi instead of Metazoa. Relative functional category compositions were not used as features to train the model by the fact that they are correlated between them. Instead, the models were trained with the components retrieved from the correspondence analyses on the relative functional category compositions of opisthokont terminal nodes (relative compositions were computed excluding the S ‘unknown function’ category and doing first a column-wise and then a row-wise normalization before correspondence analyses was performed). Once models were trained, we computed the probability of belonging to the given class (Metazoa or Fungi, depending on the model) for every opisthokont node, including both terminal (used for model training) and internal (not used for model training) (see values in Supplementary Table 5). The probabilities represented in Extended Data Fig. 4d correspond to a weighted average over the probabilities retrieved from every classifier (excluding logistic regression for being in disagreement and showing worse predictions than the other classifiers). The weights were determined in the following manner: for every node, the average probability was computed, and then we computed the variance of the four models with respect to that averages. The weight of every model corresponds to the inverse of the relative variance of that model divided by the sum of the variances of the four models. The code is available at 10.6084/m9.figshare.13140191.v1 (‘fungiMetazoa_predModels’ in Code.300322.zip). We expect the predictors to capture the genomic compositional features well, as, for example, in the case of Metazoa, Trichoplax adherens, the animal with the lowest degree of phenotypic complexity among the sampled species, is the node with lowest probability (Extended Data Fig. 4d). All of these analyses were carried out in Python using packages from Sci-kit learn70, TensorFlow71 and Keras72 libraries.
Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Research reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41586-022-05110-4.
Supplementary information
Acknowledgements
E.O.-P. was supported by a predoctoral FPI grant from MINECO (BES-2015-072241) and by ESF Investing in your future. E.O.-P., D.L-E., A.S.A. and I.R.-T. received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7-2007-2013) (Grant agreement No. 616960) and also from grants (BFU2014-57779-P and PID2020-120609GB-I00) by MCIN/AEI/10.13039/501100011033 and ‘ERDF A way of making Europe’. E.O.-P. and G.J.Sz. received funding from the European Research Council under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 714774). T.A.W. was supported by a Royal Society University Research Fellowship (URF\R\201024) and NERC standard grant NE/P00251X/1. This work was supported by the Gordon and Betty Moore Foundation through grant GBMF9741 to T.A.W. and G.J.Sz. J.S.P. and E.B. received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7-2007-2013) (Grant agreement No. 615274). D.V.T. and cell culturing were supported by the Russian Science Foundation grant no. 18-14-00239, https://rscf.ru/project/18-14-00239/. Culture of P. vietnamica was obtained as the result of field work in Vietnam as part of the project ‘Ecolan 3.2’ of the Russian–Vietnam Tropical Center. P.J.K. is supported by an Investigator Award from the Gordon and Betty Moore Foundation (10.37807/GBMF9201). We thank the CRG/UPF FACS Unit, the CRG Genomics Unit and M. Antó-Subirats for technical assistance; D. J. Richter, M. M. Leger and I. Patten for the feedback provided on the manuscript; and M. J. Greenacre for the feedback provided on multivariate statistics.
Extended data figures and tables
Author contributions
E.O.-P. conceptualized the study and wrote the draft of the manuscript under the supervision of I.R.-T. E.O.-P., D.L.-E. and A.S.A. generated the material for sequencing. E.O.-P. and A.S.A. made the figures for the manuscript. E.O.-P. performed all bioinformatic analyses (unless those specified below). T.A.W. performed the gene tree–species tree reconciliation analyses and the Bayesian species tree reconstruction, and provided feedback about the project. G.J.Sz. contributed substantially to reviewing the manuscript and providing feedback about the project. J.S.P. and E.B. adapted the software of CompositeSearch and provided feedback about the project. D.V.T. and P.J.K. provided polyxenic cultures from P. vietnamica, P. chileana and P. caudatus. All authors contributed to the review of the manuscript before submission for publication and approved the final version.
Peer review
Peer review information
Nature thanks Maja Adamska, James McInerney and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.
Data availability
The raw sequence data and assembled genomes generated in this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB52884 (https://www.ebi.ac.uk/ena/browser/view/PRJEB52884). The genome assemblies are also available in figshare (10.6084/m9.figshare.19895962.v1). Protein sequences of the species used in this study were downloaded from the GenBank public databases (https://www.ncbi.nlm.nih.gov/protein/), Uniprot (https://www.uniprot.org/), JGI genome database (https://genome.jgi.doe.gov/portal/) and Ensembl genomes (https://www.ensembl.org). The following specific databases were also used in this study: Pfam A v29 (https://pfam.xfam.org/), EggNOG emapperdb-4.5.1 (http://eggnog5.embl.de) and UniProt reference proteomes release 2016_02 (https://www.uniprot.org/). The supporting data files of this study are available in the following repository: 10.6084/m9.figshare.13140191.v1.
Code availability
The most relevant custom code developed for this study (the MAPBOS pipeline and the machine learning classifiers) is available at 10.5281/zenodo.6586559.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Eduard Ocaña-Pallarès, Email: ed3716@gmail.com.
Iñaki Ruiz-Trillo, Email: inaki.ruiz@ibe.upf-csic.es.
Extended data
is available for this paper at 10.1038/s41586-022-05110-4.
Supplementary information
The online version contains supplementary material available at 10.1038/s41586-022-05110-4.
References
- 1.Adl SM, et al. Revisions to the classification, nomenclature, and diversity of eukaryotes. J. Eukaryot. Microbiol. 2019;66:4–119. doi: 10.1111/jeu.12691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Torruella G, et al. Phylogenomics reveals convergent evolution of lifestyles in close relatives of animals and fungi. Curr. Biol. 2015;25:2404–2410. doi: 10.1016/j.cub.2015.07.053. [DOI] [PubMed] [Google Scholar]
- 3.Andersson JO. Lateral gene transfer in eukaryotes. Cell. Mol. Life Sci. 2005;62:1182–1197. doi: 10.1007/s00018-005-4539-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Keeling PJ, Palmer JD. Horizontal gene transfer in eukaryotic evolution. Nat. Rev. Genet. 2008;9:605–618. doi: 10.1038/nrg2386. [DOI] [PubMed] [Google Scholar]
- 5.Doolittle WF. Phylogenetic classification and the universal tree. Science. 1999;284:2124–2128. doi: 10.1126/science.284.5423.2124. [DOI] [PubMed] [Google Scholar]
- 6.Wainright P, Hinkle G, Sogin ML, Stickel SK. Monophyletic origins of the metazoa: an evolutionary link with fungi. Science. 1993;260:340–342. doi: 10.1126/science.8469985. [DOI] [PubMed] [Google Scholar]
- 7.Del Campo J, et al. The others: our biased perspective of eukaryotic genomes. Trends Ecol. Evol. 2014;29:252–259. doi: 10.1016/j.tree.2014.03.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Richards TA, Leonard GUY, Wideman JG. What defines the “kingdom” Fungi? Microbiol. Spectr. 2017;5:3. doi: 10.1128/microbiolspec.FUNK-0044-2017. [DOI] [PubMed] [Google Scholar]
- 9.Torruella G, et al. Global transcriptome analysis of the aphelid Paraphelidium tribonemae supports the phagotrophic origin of fungi. Commun. Biol. 2018;1:231. doi: 10.1038/s42003-018-0235-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Galindo LJ, López-García P, Torruella G, Karpov S, Moreira D. Phylogenomics of a new fungal phylum reveals multiple waves of reductive evolution across Holomycota. Nat. Commun. 2021;12:4973. doi: 10.1038/s41467-021-25308-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Tong SM. Heterotrophic flagellates and other protists from Southampton Water, U.K. Ophelia. 1997;47:71–131. doi: 10.1080/00785236.1997.10427291. [DOI] [Google Scholar]
- 12.Hehenberger E, et al. Novel predators reshape Holozoan phylogeny and reveal the presence of a two-component signaling system in the ancestor of animals. Curr. Biol. 2017;27:2043–2050. doi: 10.1016/j.cub.2017.06.006. [DOI] [PubMed] [Google Scholar]
- 13.López-Escardó D, López-García P, Moreira D, Ruiz-Trillo I, Torruella G. Parvularia atlantis gen. et sp. nov., a Nucleariid Filose Amoeba (Holomycota, Opisthokonta. J. Eukaryot. Microbiol. 2018;65:170–179. doi: 10.1111/jeu.12450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Tatusov RL. The COG database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 2000;28:33–36. doi: 10.1093/nar/28.1.33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Suga H, et al. The Capsaspora genome reveals a complex unicellular prehistory of animals. Nat. Commun. 2013;4:2325. doi: 10.1038/ncomms3325. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Ros-Rocher N, Pérez-Posada A, Leger MM. The origin of animals: an ancestral reconstruction of the unicellular-to-multicellular transition. Open Biol. 2021;11:200359. doi: 10.1098/rsob.200359. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.King N, et al. The genome of the choanoflagellate Monosiga brevicollis and the origin of metazoans. Nature. 2008;451:783–788. doi: 10.1038/nature06617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Grau-Bové X, et al. Dynamics of genomic innovation in the unicellular ancestry of animals. eLife. 2017;6:e26036. doi: 10.7554/eLife.26036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Richter DJ, Fozouni P, Eisen MB, King N. Gene family innovation, conservation and loss on the animal stem lineage. eLife. 2018;7:e34226. doi: 10.7554/eLife.34226. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Paps J, Holland PWH. Reconstruction of the ancestral metazoan genome reveals an increase in genomic novelty. Nat. Commun. 2018;9:1730. doi: 10.1038/s41467-018-04136-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Fernández R, Gabaldón T. Gene gain and loss across the metazoan tree of life. Nat. Ecol. Evol. 2020;4:524–533. doi: 10.1038/s41559-019-1069-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Stajich JE, et al. The Fungi. Curr. Biol. 2009;19:R840–R845. doi: 10.1016/j.cub.2009.07.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Szöllősi GJ, Davín AA, Tannier E, Daubin V, Boussau B. Genome-scale phylogenetic analysis finds extensive gene transfer among fungi. Phil. Trans. R. Soc. B. 2015;370:20140335. doi: 10.1098/rstb.2014.0335. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Ocaña-Pallarès E, Najle SR, Scazzocchio C, Ruiz-Trillo I. Reticulate evolution in eukaryotes: origin and evolution of the nitrate assimilation pathway. PLoS Genet. 2019;15:e1007986. doi: 10.1371/journal.pgen.1007986. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Boto L. Horizontal gene transfer in the acquisition of novel traits by metazoans. Proc. R. Soc. B. 2014;281:20132450. doi: 10.1098/rspb.2013.2450. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Irwin NAT, Pittis AA, Richards TA, Keeling PJ. Systematic evaluation of horizontal gene transfer between eukaryotes and viruses. Nat. Microbiol. 2021;7:327–336. doi: 10.1038/s41564-021-01026-3. [DOI] [PubMed] [Google Scholar]
- 27.Bock R. The give-and-take of DNA: horizontal gene transfer in plants. Trends Plant Sci. 2010;15:11–22. doi: 10.1016/j.tplants.2009.10.001. [DOI] [PubMed] [Google Scholar]
- 28.Martin WF. Too much eukaryote LGT. BioEssays. 2017;39:1700115. doi: 10.1002/bies.201700115. [DOI] [PubMed] [Google Scholar]
- 29.Leger MM, Eme L, Stairs CW, Roger AJ. Demystifying eukaryote lateral gene transfer. BioEssays. 2018;40:1700242. doi: 10.1002/bies.201700242. [DOI] [PubMed] [Google Scholar]
- 30.Roger AJ. Reply to ‘Eukaryote lateral gene transfer is Lamarckian’. Nat. Ecol. Evol. 2018;2:755. doi: 10.1038/s41559-018-0522-6. [DOI] [PubMed] [Google Scholar]
- 31.Leonard G, Richards TA. Genome-scale comparative analysis of gene fusions, gene fissions, and the fungal tree of life. Proc. Natl Acad. Sci. USA. 2012;109:21402–21407. doi: 10.1073/pnas.1210909110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Richards TA, Talbot NJ. Osmotrophy. Curr. Biol. 2018;28:R1179–R1180. doi: 10.1016/j.cub.2018.07.069. [DOI] [PubMed] [Google Scholar]
- 33.Nagy LG, Kovács GM, Krizsán K. Complex multicellularity in fungi: evolutionary convergence, single origin, or both? Biol. Rev. 2018;93:1778–1794. doi: 10.1111/brv.12418. [DOI] [PubMed] [Google Scholar]
- 34.Hoff KJ, Lange S, Lomsadze A, Borodovsky M, Stanke M. BRAKER1: unsupervised RNA-seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics. 2015;32:767–769. doi: 10.1093/bioinformatics/btv661. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Haas BJ, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 2003;31:5654–5666. doi: 10.1093/nar/gkg770. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J. Mol. Biol. 1990;215:403–410. doi: 10.1016/S0022-2836(05)80360-2. [DOI] [PubMed] [Google Scholar]
- 37.Emms DM, Kelly S. OrthoFinder: solving fundamental biases in whole genome comparisons dramatically improves orthogroup inference accuracy. Genome Biol. 2015;16:157. doi: 10.1186/s13059-015-0721-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Landan G, Graur D. Heads or tails: a simple reliability check for multiple sequence alignments. Mol. Biol. Evol. 2007;24:1380–1383. doi: 10.1093/molbev/msm060. [DOI] [PubMed] [Google Scholar]
- 39.Landan G, Graur D. Local reliability measures from sets of co-optimal multiple sequence alignments. Pacific Symp. Biocomput. 2008;13:15–24. [PubMed] [Google Scholar]
- 40.Chatzou M, et al. Multiple sequence alignment phylogenetic tree reconstruction bootstrap analysis evolutionary analysis. Syst. Biol. 2018;67:997–1009. doi: 10.1093/sysbio/syx096. [DOI] [PubMed] [Google Scholar]
- 41.Capella-Gutierrez S, Silla-Martinez JM, Gabaldon T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics. 2009;25:1972–1973. doi: 10.1093/bioinformatics/btp348. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods. 2017;14:587–589. doi: 10.1038/nmeth.4285. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Quang LS, Gascuel O, Lartillot N. Empirical profile mixture models for phylogenetic reconstruction. Bioinformatics. 2008;24:2317–2323. doi: 10.1093/bioinformatics/btn445. [DOI] [PubMed] [Google Scholar]
- 45.Wang HC, Minh BQ, Susko E, Roger AJ. Modeling site heterogeneity with posterior mean site frequency profiles accelerates accurate phylogenomic estimation. Syst. Biol. 2018;67:216–235. doi: 10.1093/sysbio/syx068. [DOI] [PubMed] [Google Scholar]
- 46.Lartillot N, Rodrigue N, Stubbs D, Richer J. PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. Syst. Biol. 2013;62:611–615. doi: 10.1093/sysbio/syt022. [DOI] [PubMed] [Google Scholar]
- 47.Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat. Methods. 2015;12:59–60. doi: 10.1038/nmeth.3176. [DOI] [PubMed] [Google Scholar]
- 48.Brown MW, et al. Phylogenomics places orphan protistan lineages in a novel eukaryotic super-group. Genome Biol. Evol. 2018;10:427–433. doi: 10.1093/gbe/evy014. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Janouškovec J, et al. A new lineage of eukaryotes illuminates early mitochondrial genome reduction. Curr. Biol. 2017;27:3717–3724. doi: 10.1016/j.cub.2017.10.051. [DOI] [PubMed] [Google Scholar]
- 50.Parfrey LW, Lahr DJG, Knoll AH, Katz LA. Estimating the timing of early eukaryotic diversification with multigene molecular clocks. Proc. Natl Acad. Sci. USA. 2011;108:13624–13629. doi: 10.1073/pnas.1110633108. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Karnkowska A, et al. A eukaryote without a mitochondrial organelle. Curr. Biol. 2016;26:1274–1284. doi: 10.1016/j.cub.2016.03.053. [DOI] [PubMed] [Google Scholar]
- 52.Derelle R, et al. Bacterial proteins pinpoint a single eukaryotic root. Proc. Natl Acad. Sci. USA. 2015;112:E693–E699. doi: 10.1073/pnas.1420657112. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Derelle R, López-García P, Timpano H, Moreira D. A phylogenomic framework to study the diversity and evolution of stramenopiles (=heterokonts) Mol. Biol. Evol. 2016;33:2890–2898. doi: 10.1093/molbev/msw168. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Strassert JFH, Jamy M, Mylnikov AP, Tikhonenkov DV, Burki F. New phylogenomic analysis of the enigmatic phylum Telonemia further resolves the eukaryote tree of life. Mol. Biol. Evol. 2019;36:757–765. doi: 10.1093/molbev/msz012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Burki F, et al. Untangling the early diversification of eukaryotes: a phylogenomic study of the evolutionary origins of Centrohelida, Haptophyta and Cryptista. Proc. R. Soc. B. 2016;283:20152802. doi: 10.1098/rspb.2015.2802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Betts HC, et al. Integrated genomic and fossil evidence illuminates life’s early evolution and eukaryote origin. Nat. Ecol. Evol. 2018;2:1556–1562. doi: 10.1038/s41559-018-0644-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 57.Pathmanathan JS, Lopez P, Lapointe F-J, Bapteste E. CompositeSearch: a generalized network approach for composite gene families detection. Mol. Biol. Evol. 2017;35:252–255. doi: 10.1093/molbev/msx283. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins. 1997;28:405–420. doi: 10.1002/(SICI)1097-0134(199707)28:3<405::AID-PROT10>3.0.CO;2-L. [DOI] [PubMed] [Google Scholar]
- 59.Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 2000;28:27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Huerta-Cepas J, et al. Fast genome-wide functional annotation through orthology assignment by eggNOG-Mapper. Mol. Biol. Evol. 2017;34:2115–2122. doi: 10.1093/molbev/msx148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Pandas Development Team. pandas-dev/pandas: Pandas. Zenodo 10.5281/zenodo.6702671 (2020).
- 62.Harris CR, et al. Array programming with NumPy. Nature. 2020;585:357–362. doi: 10.1038/s41586-020-2649-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 63.Letunic I, Bork P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics. 2007;23:127–128. doi: 10.1093/bioinformatics/btl529. [DOI] [PubMed] [Google Scholar]
- 64.Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer-Verlag, 2016).
- 65.Virtanen P, et al. Fundamental algorithms for scientific computing in Python. Nat. Methods. 2020;17:261–272. doi: 10.1038/s41592-019-0686-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Kanehisa M, Sato Y, Kawashima M, Furumichi M, Tanabe M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016;44:D457–D462. doi: 10.1093/nar/gkv1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.R Core Team. R: A Language and Environment for Statistical Computing. https://www.r-project.org/ (R Foundation for Statistical Computing, 2017).
- 68.Lê S, Josse J, Husson F. FactoMineR: an R package for multivariate analysis. J. Stat. Softw. 2008;25:1–18. doi: 10.18637/jss.v025.i01. [DOI] [Google Scholar]
- 69.Kassambara, A. & Mundt, F. factoextra: Extract and visualize the results of multivariate data analyses. Version 1.0.6 https://CRAN.R-project.org/paackage=factoextra (2019).
- 70.Pedregosa F, et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 71.Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems. Preprint at https://arxiv.org/abs/1603.04467 (2015).
- 72.Chollet, F. et al. Keras. GitHub https://github.com/fchollet/keras (2015).
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The raw sequence data and assembled genomes generated in this study have been deposited in the European Nucleotide Archive (ENA) at EMBL-EBI under accession number PRJEB52884 (https://www.ebi.ac.uk/ena/browser/view/PRJEB52884). The genome assemblies are also available in figshare (10.6084/m9.figshare.19895962.v1). Protein sequences of the species used in this study were downloaded from the GenBank public databases (https://www.ncbi.nlm.nih.gov/protein/), Uniprot (https://www.uniprot.org/), JGI genome database (https://genome.jgi.doe.gov/portal/) and Ensembl genomes (https://www.ensembl.org). The following specific databases were also used in this study: Pfam A v29 (https://pfam.xfam.org/), EggNOG emapperdb-4.5.1 (http://eggnog5.embl.de) and UniProt reference proteomes release 2016_02 (https://www.uniprot.org/). The supporting data files of this study are available in the following repository: 10.6084/m9.figshare.13140191.v1.
The most relevant custom code developed for this study (the MAPBOS pipeline and the machine learning classifiers) is available at 10.5281/zenodo.6586559.